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Abstract 

This paper describes a Choleski method used to solve linear system , 
of equations that arise in large-scale structural analyses . The method 
uses a novel variable-band storage scheme and is structured to exploit 
fast local memory caches while minimizing data access delays between 
main memory and vector registers. Several parallel implementations 
of this method are described for the CRAY-2 and CRAY Y-MP com- 
puters demonstrating the use of microtasking and autotasking direc- 
tives. A portable parallel language, FORCE \ is used for comparison with 
the microtasked and autotasked implementations. Results are presented 
comparing the matrix factorization times for three representative struc- 
tural analysis problems from runs made in both dedicated and multiuser 
modes on both the CRAY-2 and CRAY Y-MP computers. Both CPU 
(central processing unit) times and wall clock times are given for the 
parallel implementations and are compared with single-processor times 
of the same algorithm. Computation rates over 1 GIGAFLOP (1 billion 
floating point operations per second) on a four-processor CRAY-2 and 
over 2 GIGAFLOPS on an eight-processor CRAY Y-MP are demon- 
strated as measured by wall clock time in a dedicated environment. Re- 
duced wall clock times for the parallel implementations relative to the 
single-processor implementation of the same Choleski algorithm are also 
demonstrated for runs made in a multiuser environment. 


1. Introduction 

The computational cost of computerized struc- 
tural analysis is often dominated by the cost of solv- 
ing a very large system of algebraic equations associ- 
ated with a finite element model. This linear system 
of equations has the form 

Ku = f (1) 

where K is the symmetric positive-definite stiffness 
matrix, f is the load vector, and u is the vector of 
generalized displacements. The linear systems can 
be as large as several hundred thousand degrees of 
freedom and often require significant computing re- 
sources, in terms of both memory and execution time. 
The development of algorithms to solve these linear 
systems on existing and future high-performance su- 
percomputers that have multiple parallel high-speed 
vector CPU’s (central processing units) can signifi- 
cantly reduce the computer time required for struc- 
tural analysis. Efficient algorithms that exploit both 
vectorization and parallelization gain the maximum 
benefit on such supercomputers. 

One widely used method for solving equation (1) 
is the Choleski method (ref. 1). In the Choleski 
method the stiffness matrix K is factored into the 
product of the triangular matrices LL^, followed by 


a forward-and-backward solution using the triangular 
systems 

Lz = f (forward) (2a) 

L^u = z (backward) (2b) 

The factorization of K, which is by far the most com- 
putationally expensive part of the Choleski method, 
can be carried out in many w r ays. The particular 
implementation of the Choleski method that is the 
most effective for a class of problems depends on both 
the structure of the linear systems and the architec- 
tural features of the computer used for the analy- 
sis. Though the structure of K for many structural 
analysis applications is initially very sparse, usually 
20 to 50 nonzero coefficients in a given row of the 
matrix, the factored matrix L contains many more 
coefficients produced during the factorization step. 
For many structural analysis problems, a reordering 
of the nodes in the finite element model can be car- 
ried out that minimizes the bandwidth of the fac- 
tored matrix, thereby significantly reducing matrix 
storage requirements and the number of computa- 
tions required. A novel variable-band matrix storage 
scheme is used with the Choleski implementations 
described in this paper to reduce the computational 
cost of the factorization. The reverse Cuthill-McKee 



reordering algorithm (ref. 2) is used to minimize the 
bandwidth of the linear systems. 

In this paper a vectorized variable-band Choleski 
method, denoted as VBAND, is described for single- 
processor implementation. A parallel Choleski equa- 
tion solver, based on the VBAND method and 
denoted as PVBAND, is also presented. Several 
implementations of the PVBAND method are de- 
scribed demonstrating three different approaches for 
implementing parallel algorithms on CRAY multi- 
processor supercomputers (manufactured by Cray 
Research, Inc.). The performance of the parallel 
implementations is demonstrated by comparing fac- 
torization times for several large representative struc- 
tural analysis problems. In a dedicated comput- 
ing environment, maximum computation rates over 

1 GIGAFLOP (1 billion floating point operations 
per second) on a four-processor CRAY-2 and over 

2 GIGAFLOPS on an eight-processor CRAY Y-MP 
are achieved for the matrix factorization portion of 
the computerized structural analyses. In a multiuser 
environment, reduced wall clock times are demon- 
strated for the parallel methods as compared with 
the wall clock times for the single-processor VBAND 
Choleski algorithm. 

The outline of this paper is as follows: In sec- 
tion 2, architectural features of the CRAY multipro- 
cessor supercomputers are described. Key hardware 
featiires of the CRAY-2 and CRAY Y-MP computers 
are compared and contrasted, and implementation 
issues related to exploiting vectorization as well as 
parallel processing on CR AY supercomputers are dis- 
cussed. CRAY multitasking alternatives (macrot ask- 
ing, microtasking, and autotasking) are introduced. 
In section 3, both single-processor and parallel im- 
plementations of the variable-band Choleski method 
are described. Detailed discussions of the differences 
between the parallel implementations on the CRAY 
computers are included. The implementation details 
provide an aid for understanding the differences in 
performance across the parallel implementations. In 
section 4, the role of equation solvers in computer- 
ized structural analysis is briefly discussed. Three 
representative structural analysis problems that are 
used to compare the parallel implementations of the 
variable-band Choleski method are described. In sec- 
tion 5, numerical results are presented from multiple 
runs in both dedicated and multiuser environments, 
and in section 6, concluding remarks are given. Fi- 
nally, an appendix is given containing FORTRAN 
listings for several of the subroutines described in this 
paper. 


2. Parallel- Vector Computer 
Architectures 

The experiments described in this report were 
performed on a CRAY- 2 at the NASA Langley 
Research Center (LaRC) and a CRAY Y-MP at 
the NASA Ames Research Center (ARC). This sec- 
tion briefly describes the key architectural features 
of these computers including a discussion of the 
major hardware differences of the CRAY-2 and 
CRAY Y-MP machines that affect algorithm de- 
sign. In addition, optimization issues related to 
exploiting both the vector and parallel processing 
capabilities of CRAY supercomputers are briefly 
introduced. Differences between macrotasking, 
microtasking, and autotasking that affect algorithm 
performance are discussed at the end of the parallel 
processing subsection. 

CRAY Hardware 

The CRAY-2 and the CRAY Y-MP are shared- 
memory multiprocessor computers that have a max- 
imum of four and eight central processing units 
(CPU’s), respectively. Each CPU has multiple vector 
arithmetic and logic functional units that access very 
large main memories through eight high-speed vector 
registers. The maximum computation rate, com- 
monly measured in units known as MFLOPS (mil- 
lions of floating point operations per second), is at- 
tained when both addition and multiplication vector 
functional units are operating simultaneously, thus 
producing two results every machine clock period. 
The CRAY-2 has a clock period of 4.1 nanoseconds 
(nsec) that results in a maximum theoretical rate of 
488 MFLOPS per processor. The CRAY Y-MP used 
for the experiments in this paper has a clock period of 
6.0 nsec that results in a maximum theoretical rate of 
333 MFLOPS per processor. These theoretical peak 
rates are seldom achieved because of several factors, 
two of which are discussed next. 

Memory access delays . One factor that re- 
duces the computation rate is memory access delay. 
Some delay is always incurred if the operands for 
a vector addition or multiplication must be trans- 
ferred from main memory to the vector registers. As 
soon as the elements are in the vector registers they 
can be accessed at the maximum computation rath. 
Both the CRAY Y-MP and the CRAY-2 have 128 
million words of main memory arranged in banks. 
Since memory locations in the same bank cannot be 
accessed in successive clock periods, contiguous array 
elements are stored in an interleaved pattern across 
the banks. This storage pattern allows contiguous 
array elements to be accessed at successive clock pe- 
riods after an initial access delay. In general, the 
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does not have chaining, the memory references for 
a given set of operands must be completed before 
computations can begin. This problem is greatly 
reduced for FORTRAN DO loops containing many 
addition and multiplication instructions and several 
different operands since memory references on one 
set of operands can be carried out while multiplica- 
tions or additions are taking place on another set of 
operands. One technique used to increase the ratio of 
computations to memory references is referred to as 
loop unrolling. In this technique, nested loops con- 
taining single-vector instructions, such as SAXPY’s 
or inner products, are combined within a single loop 
containing several vector computations. This tech- 
nique allows vectors to remain in the vector registers 
longer before results are stored back to main memory, 
thus resulting in a more efficient use of the multiple 
functional units. 

Parallel Processing 

In addition to the hardware features of the 
CRAY-2 and CRAY Y-MP that improve perfor- 
mance on a single CPU, parallel processing is avail- 
able. The individual CPU’s share a large main mem- 
ory, and synchronization is accomplished using both 
the shared-memory and special hardware semaphore 
registers. Because current CRAY computers have 
small numbers of very powerful processors, the best 
overall speedups are obtained by parallelizing highly 
vectorized codes. The speedup from parallel process- 
ing is always less than p, where p is the number of 
processors, whereas the speedup due to vectorization 
often exceeds 20 or more on a single processor for 
many codes. Three types of multitasking are avail- 
able on the CRAY computers and are described be- 
low. The PVBAND method was implemented using 
all three types of multitasking to determine the best 
approach for both multiuser and dedicated modes. 
Details of the parallel implementations are given in 
section 3. 

Macrotasking . The earliest form of parallel pro- 
cessing developed for CRAY computers was macro- 
tasking. Macrotasking is intended to be used with 
large codes running in a dedicated environment. 
Macrotasked codes perform best when separate par- 
allel tasks, requiring little synchronization, are de 1 
fined at the subroutine level. Macrotasking is the 
most difficult form of multitasking used on CRAY 
machines. Since parallelism must be defined at a 
higher level, macrotasking typically requires more 
code changes and data analysis than microtasking 
and autotasking. Parallelism is defined through the 
insertion of library subroutines, making the code un- 


portable to other machines. (See refs. 3 and 4 for 
detailed discussions on macrotasking.) 

Macrotasking requires extensive task manage- 
ment that is handled by the library scheduler, a col- 
lection of library subroutines. Parallel tasks , inde- 
pendent portions of work, are initiated through calls 
to the TSKSTART and are synchronized by events , 
locks , or the TSKWAIT subroutine. TSKSTART cre- 
ates the logical CPU’s, or processes , if they have not 
been created previously. When a task changes state,' 
the library scheduler places the task in the appropri- 
ate library queue. For example, the scheduler peri- 
odically checks the Ready queue for tasks that can be 
attached to logical CPU’s. Attachment to a logical 
CPU designates a task as schedulable for execution 
on a physical CPU. On the other hand, if a task must 
wait for another task to be completed, the library 
scheduler disconnects the waiting task from the log- 
ical CPU and places it in the Suspended queue. In 
addition, if no new tasks are ready, the logical CPU 
is “put to sleep,” marking that process as unschedu- 
lable. That logical CPU must be reactivated later. 

A portable parallel language, FORCE, devel- 
oped for shared-memory parallel computers (refs. 5 
and 6) has been implemented on the CRAY com- 
puters using macrotasking. FORCE is a machine- 
independent parallel language that allows program- 
mers to implement parallel algorithms using a fixed 
syntax. Both parallel constructs (such as barriers 
and critical regions) and data types (i.e., private and 
shared variables) can be programmed without de- 
tailed knowledge of the implementation of the paral- 
lel constructs and data-type definitions on each com- 
puter. FORCE uses the UNIX sed editor and a set 
of script files to substitute machine specific code for 
each of the FORCE statements. In addition to re- 
placing FORCE statements in the user’s program, 
the FORCE preprocessing phase converts the user’s 
main program to a subroutine that is called by a stan- 
dard driver routine at run time. The driver routine 
is not seen by the user and serves to create, through 
calls to TSKSTART, the fixed number of tasks spec- 
ified by the user at run time. The entire user’s pro- 
gram is called from the main driver program by each 
task. Sequential portions of the code must be placed 
within FORCE barrier statements or other explicit 
code statements. At the end of the program, the 
FORCE driver routine terminates the tasks with the 
TSKWAIT subroutine. 

Microtasking. Microtasking was designed to 
lower the overhead for CPU synchronization and 
to allow for more efficient multitasking in a mul- 
tiuser environment. In a multiuser environment, 
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maximum transfer rate between main memory and 
the vector registers occurs when contiguous array el- 
ements are accessed from main memory. 

There are two key differences in the architectures 
of the CRAY -2 and CRAY Y-MP that affect memory 
access delays. The first is the number of paths to 
memory. On the CRAY- 2, there is only a single 
path between main memory and the vector registers. 
On the CRAY Y-MP, there are four paths to main 
memory; two can be used to read from memory, one 
to write from the registers to main memory, and 
one for input/output (I/O) operations. The second 
key difference is the use of local memory caches on 
the CRAY-2. Each processor on the CRAY-2 has 
a 16 384- word local memory cache. Local memory 
is accessible only through the vector registers, and 
only contiguous access of array elements is possible. 
However, the time delay incurred when accessing 
array elements stored in local memory is significantly 
less than that when accessing array elements stored 
in main memory. As a result, vectors that are 
used repeatedly in computations may be stored in 
local memory in order to significantly reduce memory 
access delays. 

Floating point operations and chaining. 
Another key factor that affects the computation rate 
is the type of operations performed to carry out an 
algorithm. For example, vector SAXPY operations 
(single precision a x X + Y) are efficient on CRAY 
computers since they contain both addition and mul- 
tiplication instructions that can be carried out nearly 
simultaneously. However, inner-product instructions 
( Sum = XiVi) are somewhat less efficient since 

they require a summation of many vector elements 
into a single scalar value. 

An additional feature that improves the perfor- 
mance of floating point operations on the CRAY 
Y-MP is the chaining of operations and memory ac- 
cesses. Chaining occurs whenever vector instructions 
on a set of operands are allowed to overlap, thus 
reducing the start-up overhead associated with in- 
dividual vector instructions. For example, on the 
CRAY Y-MP, a vector add of 2 arrays multiplied by 
a constant begins by issuing a load instruction for the 
first 64 elements of the first array that is followed, on 
the very next clock period, by a second load instruc- 
tion for the first 64 elements of the second array. As 
soon as the first operands of the two arrays arc in 
the vector registers, the addition operation begins. 
The multiplication operation begins as soon as the 
first result is available from the add functional unit. 
Finally, the store instruction to transfer the first 64 
results back to main memory is issued as soon as the 


first results are available from the multiply functional 
unit. If the length of the arrays is n, then this exam- 
ple requires O(n) clock periods on the CRAY Y-MP. 
On the CRAY-2, the same example would require 
0(3n) clock periods to load the two vector operands 
and store the result using the single path to memory 
and an additional 0(2n) clock periods to complete 
the addition and multiplication instructions. 

Although functional unit chaining, or tailgat- 
ing, is available on some CRAY-2 machines, most 
CRAY- 2 computers have no chaining capability. 
However, the CRAY-2 computers can benefit from si- 
multaneous memory access and computations if these 
operations take place on different sets of operands. 
Loops that contain several addition and multipli- 
cation operations with several independent sets of 
operands are often able to benefit from simultaneous 
memory access and computations. In general, loops 
that execute at high vector speeds on the CRAY-2 
also perform well on the CRAY Y-MP, but the con- 
verse is not necessarily true. 

Vector Optimization 

In order to achieve high computation rates on 
vector computers, algorithms must be designed to ex- 
ploit the key hardware features discussed above. The 
challenge to the software developer is to design algo- 
rithms that minimize the memory access delays while 
utilizing the full computing power of multiple vector 
functional units. Three key vector optimization is- 
sues axe vector length, vector stride, and the ratio 
of memory references to computations. Long vector 
operations (i.e., operations on hundreds or even thou- 
sands of array elements in a single loop) are desirable 
since the cost due to loop overhead and initial mem- 
ory access is amortized over a larger number of com- 
putations. Vector stride refers to the memory access 
pattern for arrays. Stride one vector operations occur 
when arrays are accessed contiguously. This occurs, 
for example, in FORTRAN DO loops where the first 
dimension of an array reference is incremented by one 
for each loop iteration. Although much improvement 
has been made on vector computers in recent years 
to improve memory access for strides other than one, 
stride one vector operations are still faster on both 
the CRAY-2 and CRAY Y-MP. On the CRAY-2, the 
arrangement of memory banks into quadrants makes 
even stride accesses particularly bad, especially when 
the stride is a multiple of four. 

The ratio of computations to memory references 
for vector computations is the third key issue on vec- 
tor optimization. This ratio is particularly important 
for the CRAY-2 since there is only one path between 
memory and vector registers. Because the CRAY-2 
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microt asked codes dynamically adjust to the avail- 
able resources, thus making efficient use of proces- 
sors that are available for short periods of time. 
Since smaller grained parallelism can be efficiently 
exploited using microtasking, parallelism is typically 
defined at the DO loop level. Outermost loops 
are usually parallelized, whereas inner loops contain 
vector computations. Microtasking is implemented 
through the use of compiler directives that appear 
as comments to other compilers, thus maintaining 
portability to other computers. Discussions of mi- 
crotasking can be found in references 3 and 4. 

Microtasking is designed for multiuser environ- 
ments and does not require the task management 
used by macrotasking. In microtasking, additional 
processes referred to as slave processes are created at 
the beginning of the program and immediately enter 
a library subroutine called PARK where they remain 
until summoned by the master process. Meanwhile, 
the master process executes all serial code outside 
the microtasked subroutines. Upon entering a mi- 
crotasked subroutine, the master process signals the 
slaves to enter the parallel region. If a slave process 
is active, i.e., if the process is executing on a physi- 
cal CPU, the slave process enters the parallel region. 
The code within microtasked subroutines is executed 
by all active processes unless it is contained within 
control structures. Within a control structure, par- 
allel work is distributed to active processes, and all 
work within a control structure must be completed 
before any CPU can proceed past that point. A 
unique number associated with each control structure 
as well as the number of active processes within each 
control structure are stored in special shared vari- 
ables. These variables are updated by several special 
library routines that are called from the user’s pro- 
gram. The calls to these routines are automatically 
inserted by the microtasking preprocessor at com- 
pilation time. All variables in COMMON blocks, 
SAVE statements, DATA statements, or in the ar- 
gument list of a microtasked subroutine are shared 
variables and can be accessed by all CPU’s. All other 
variables, such as loop indices, are private variables, 
and each CPU accesses a separate copy. 

Autotasking . Autot asking is the most recent 
parallel programming tool available to programmers 
on CRAY computers. (See ref. 7 for a complete de- 
scription of autotasking.) Autotasking expands the 
features of microtasking by adding the capability to 
automatically detect some forms of parallelism at the 
DO loop level and by improving the efficiency of cer- 
tain parallel constructs. One major difference be- 
tween microtasking and autotasking is the definition 


of parallel regions. In microtasking, the parallel re- 
gion extends to the subroutine boundaries; whereas 
in autotasking, multiple parallel regions may exist 
within a subroutine. The master process executes all 
code outside the parallel regions and must wait for 
the slave processes to finish before proceeding past 
the boundary of a parallel region. Compiler direc- 
tives, similar to microtasking directives, may be used 
to specify parallel regions wfithin the code. In this 
paper, the automatic parallelizing feature of auto- 
tasking was not used because it could not detect the 
parallelism in VBAND. Expanded capabilities for au- 
tomatic parallelization are planned in future releases 
of the compiling system. 

Multiuser and dedicated environments * The 
maximum benefit of parallel processing on the CRAY 
computers to a single user is experienced in dedi- 
cated mode. On dedicated systems, parallel work is 
executed on independent CPU’s and all the CPU’s 
are available for a single user. However, users also 
benefit from parallel processing in multiuser batch 
modes in many cases. On multiuser systems, multi- 
tasked programs typically receive more of the avail- 
able computing resources than a single-processor im- 
plementation of the same code. Multiuser computing 
environments for the CRAY-2 and CRAY Y-MP de- 
fine multitasking runs in terms of parallel work that 
may or may not run concurrently on independent 
CPU’s. In macrotasking, a fixed number of tasks are 
explicitly defined, and the synchronization of these 
parallel tasks can often result in significant delays on 
heavily loaded systems. On the other hand, micro- 
tasking and autotasking, rather than defining a fixed 
number of tasks, distribute the work only to the pro- 
cesses that are active at execution time. Those active 
processes are synchronized at the end of each con- 
trol structure and at the end of each parallel region. 
In this paper, both multiuser and dedicated modes 
of operation are discussed with results presented for 
both modes. 

3. Description of Methods 

This section briefly discusses the variable-band 
data structure and follows with descriptions of the 
sequential vectorized VBAND Choleski method and 
the parallel P VBAND Choleski method. The im- 
plementations of the sequential VBAND method il- 
lustrate the techniques used to obtain computation 
rates of greater than 200 MFLOPS on the CRAY- 2 
and CRAY Y-MP. The description of the PVBAND 
method illustrates the general approach used to par- 
allelize the VBAND method. Details of the three par- 
allel implementations are discussed in a separate sub- 
section. The details presented for the macrotasking, 
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microtasking, and autotasking implementations arc 
essential to understanding the differences in perfor- 
mance for the parallel implementations. Some results 
are given for the smallest example problem, the High- 
Speed Civil Transport (HSCT) aircraft discussed in 
section 4, to compare several implementations of the 
VBAND and PVBAND methods. 

Variable-Band Data Structure 

The variable-band data structure is described in 
references 8 and 9. The lower triangular part of the 
symmetric input stiffness matrix K in equation (1) is 
stored by columns, beginning with the main diagonal 
down to the last nonzero entry in each column, in- 
cluding zeros. Two important adjustments are made 
to the lengths of each column of K to account for fill- 
in during factorization and to ensure that columns 


in groups corresponding to the level of loop unrolling 
end in the same row. This data storage scheme is 
different from traditional skyline or profile Choleski 
methods that store the upper triangular part of K 
by columns beginning with the main diagonal and 
storing all coefficients up to the first nonzero coeffi- 
cient in each column. The skyline scheme is equiv- 
alent to storing the lower triangular part of K by 
rows, whereas the variable-band scheme stores the 
lower triangular part of K by columns. The advan- 
tage of the skyline scheme is that it does not require 
additional adjustments for fill-in and, in some cases, 
requires considerably less storage than the variable- 
band storage scheme. The big disadvantage of the 
skyline scheme is that the main computations in the 
associated Choleski methods are inner products that 
are much slower than the SAXPY operations used by 
the VBAND method on vector computers. 


Variable- band storage: 

Lower triangular part by columns 
from diagonal down to last nonzero 
coefficient in each column, with 
additional adjustments for fill 
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Skyline storage: 

Upper triangular part by columns 
from diagonal up to last nonzero 
coefficient in each column 


(a) Symmetric matrix. 


3 ,1 

a 2 ,1 

0 

0 

a 5,1 

a 2 . 2 

a ,, 

32 

0 

0 

a ,, 

3,3 

^.3 

0 

^.4 

^,4 

^ 5,5 


* * * ♦ * 


(b) Storage of variable- band scheme in single-dimensioned array. 


a „ 

^ 2,2 

a , 2 

a 0 „ 

3,3 

a 2 ,3 

a 4,4 

a 3,4 

3 5,5 

a 4.5 

0 

0 

^ 1,5 


♦ 4 * ♦ ♦ 


(c) Storage of skyline scheme in single-dimensioned array. 


Figure 1. Example of variable-band and skyline matrix storage schemes. In parts (b) and (c), arrows indicate pointers to starting 
coefficient in each column; zeros indicate memory locations allocated for fill elements during factorization. 
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Variable-band example . An example of the 
variable-band storage scheme and skyline storage 
scheme is given in figure 1 . The symmetric input 
matrix has nonzero coefficients in the positions shown 
in figure 1 (a). The position of each coefficient within 
a single-dimensioned array is shown for each storage 
scheme in figures 1 (b) and 1 (c). Auxiliary arrays 
store the starting location of each diagonal element 
in the matrix, denoted in figure 1 by arrows. The 
zeros in figure 1 represent zero coefficients that are 
stored by both schemes to allow for possible fill-in 
coefficients. During factorization the coefficients of 
K are modified, and many of the zero coefficients are 
replaced with nonzero values. The factored matrix L 
overwrites K. In the skyline storage scheme, no space 
is allocated for elements < 2^3 and a 14 in the first 
row of the upper triangular part of the input matrix 
since they lie above the last nonzero coefficient in 
columns 3 and 4, respectively. However, in the 
variable-band scheme, the corresponding elements of 
column 1 of the krwer triangular part of the input 
matrix, a^\ and lie between nonzero coefficients 
in column 1 of L and extra storage is allocated. 
Additional adjustments are made, if necessary, to 
the column lengths in both the skyline and variable- 
band storage schemes if loop unrolling is added to 
the factorization implementation. In general, loop 
unrolling requires that groups of columns end in the 
same row. For example, if loop unrolling to level 3 is 
used, then columns are arranged in groups of three 
and extra zero coefficients are allocated as required to 
ensure that each group of columns ends in the same 
row. In practice, the adjustment for loop unrolling 
increases the overall matrix storage requirement very 
little. 

Matrix assembly . The specialized variable- 
band data structure used for the Choleski methods 
described in this paper requires some interface with 
the matrix assembly process when used in an exist- 
ing finite element code. Matrix assembly routines 
can be written to directly assemble finite element 
stiffness matrices into the variable bandwidth form, 
but this approach may not be desirable in realistic 
applications. The variable-band storage scheme is 
required only for factorization. For portions of the 
analysis computations, which may require matrix- 
vector multiplications or the addition of two ma- 
trices, the variable-band storage scheme would be 
very inefficient because it would require a great many 
"additional operations. This is also true for very 
large problems with high bandwidths, where iterative 
solvers may be used in place of direct Choleski meth- 
ods. For these reasons, a more practical approach 
is to assemble the stiffness matrices into a general 


sparse matrix format that can be quickly converted 
to the variable-band format when the variable-band 
Choleski solver is to be used. 

The conversion from a general sparse matrix stor- 
age scheme, in which the lower triangular nonzero co- 
efficients are stored by columns, to the variable-band 
data storage scheme can be carried out efficiently as 
follows: The lengths of each column in the variable- 
band matrix are initially determined from the row 
indices stored for each nonzero coefficient. Then, 
the row lengths are adjusted to account for fill and 
loop unrolling as previously described. Next, the row 
index pointers for each nonzero coefficient are con- 
verted to location pointers for each coefficient within 
the variable-band matrix array. Finally, the variable- 
band array is initialized to zero and the nonzero coef- 
ficients are assigned using the integer location point- 
ers. Most of the operations used to convert from 
sparse to variable-band data storage schemes vector- 
ize, and the time required to convert from sparse to 
variable-band storage is small compared with the fac- 
torization time. In the appendix, FORTR AN listings 
for two subroutines used for this conversion are given. 

Sequential VBAND Choleski Method 

This section describes a basic Choleski factor- 
ization method and several modifications that sig- 
nificantly improve the computation rate on vector 
computers. In the Choleski method a symmetric, 
positive-definite matrix K is factored into the prod- 
uct LL t . The factorization can be carried out in 
many w T ays but, in general, elements of L are com- 
puted from K a column (or row) at a time begin- 
ning with column (or row) 1 of K and proceeding 
to the last column (or row). Only the lower part of 
K is stored, using the variable-band storage format 
described above, and L, which is computed by mod- 
ifying K, is stored in the same space as the original 
matrix K. 

Choleski factorization in ijk forms. Ref- 
erence 10 describes Choleski factorization for vector 
computers in terms of the so-called ijk forms. The 
ijk forms describe Choleski factorizations in terms 
of the nesting order of three loops that compute the 
modifications of K into L. These modifications are 
computed in the innermost loop in Choleski factor- 
ization implementations and are of the form 

Kj j <— Kjj — Lj fcLfcj (3) 

The notation , used in equation ( 3 ) and here- 
after, indicates that the variable is updated and 
overwritten with a new value specified by the right- 
hand expression. The order of the three letters i, j, 
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and A indicates the order of the three loops. For ex- 
ample, traditional skyline or profile Choleski meth- 
ods are of the ijk form where the inner vectorized 
loop varies the index k in equation (3) to compute 
the inner product of rows i and j of L. The middle 
loop, denoted by j, determines that row j of L is 
used in the inner-product computation with row i of 
L to update element Kjj. After the inner-product 
update of K^j, element Ly is finished by dividing 
K ij by the diagonal element L j j. The outer loop is 
the i loop, specifying that the computations proceed 
by rows beginning with row 1 and ending with the 
last row. 

A second example, the kji form, is better suited 
for the variable-band data structure. In the basic kji 
form, the outermost loop in the factorization is the 
k loop. This means that for each A, a new column of 
L is finished, and then each j in the middle loop of 
the factorization corresponds to column j of K that 
is modified by the SAXPY operation using column 
k of L in the innermost loop. Because the primary 
computation is a vector SAXPY operation, the kji 
form is well suited for vector computers such as the 
CRAY-2 and CRAY Y-MP. An additional benefit of 
the kji form is that the Ath column of L, which is 
used for many SAXPY updates, can be stored in the 
fast local cache memory on the CRAY-2. 

Modified kji VBAND method . The VBAND 
Choleski method, described in reference 8, is a mod- 
ified form of the basic kji Choleski method. The 
VBAND method is shown in figure 2 in pseudocode 
instructions. In this method the basic kji form is 
used for the column updates in LOOP3, but the jki 
form is used to finish groups of r columns of L. The 
jki form also uses SAXPY operations, but the or- 
der of the SAXPY column updates is different. The 
combination of these two forms works as follows: 

1. jki portion: Columns A, A + 1, A Tr — 1 ofL 
are completed (where r corresponds to the level 
of loop unrolling) at each iteration of the outer 
loop A. 

2. kji portion: Appropriate columns of K are mod- 
ified using columns A, A + 1, . . . , A + r — 1 of L, 
beginning with column A + r and ending with the 
column corresponding to the last nonzero coeffi- 
cient in the groups of r columns of L. 

The jki portion corresponds to the code required 
between LOOP1 and LOOP2 in figure 2. The first 
column A, computed in the jki portion, requires only 
a vector divide. Column A + 1 is modified using the 
just-completed column A in a single SAXPY update 


L00P1 A = 1 to n - r — 1 in steps of r 
{Complete columns A, A + 1, . . . , A + r 
— 1 of L) 

lastrk = A + len(A) — 1 
L00P2 j — A + r to lastrk 

IF L L j t k+ r ,-l are not a11 
zero THEN 

{Update column j of L using 
r columns of L via L00P3) 

L00P3 i = j to lastrk 

K ij i j — * ^jyk 

~ * 'kj.fc+l * • • 
—^iyk+r - 1 * kj,fc+r— 1 
END L00P3 
END IF 
END L00P2 
END L00P1 

{Finish any remaining columns of L 
(if n is not a multiple of r)} 

Figure 2. Sequential VBAND Choleski method. 

and is then divided by the diagonal element of col- 
umn A + 1. Column A + 2 is modified next by using 
both columns A and A-f-1 (a double SAXPY loop) and 
then is finished by a vector divide. Each successive 
column uses one additional column in the update por- 
tion up to loop unrolling level r — 1. The columns are 
copied into temporary arrays that are stored in local 
memory on the CRAY- 2. On the CRAY Y-MP, the 
r columns are also copied into temporary arrays to 
reduce loop indexing overhead. The kji portion, cor- 
responding to LOOP3 in figure 2, consists entirely of 
the multiple SAXPY updates. The SAXPY updates 
are skipped if all r scalar multipliers are zero. This 
zerochecking option reduces the computation rate 
slightly but reduces overall CPU time considerably 
by eliminating many unnecessary operations. This 
feature is especially important in the variable-band 
method for problems where the increase in storage 
requirements for the variable-band scheme is large 
compared with the requirements for a skyline stor- 
age scheme. If the zero-checking option is not used 
in these cases, the number of computations for the 
VBAND method can be much greater than the num- 
ber of computations required for a skyline Choleski 
method. 

The key computation in the VBAND algorithm 
is the multiple SAXPY update loop. Over 97 per- 
cent of the computations in the factorization oc- 
cur in this loop for each of the three problems 
solved for this study. The multiple SAXPY com- 
putations are carried out efficiently on both the 
CRAY-2 and CRAY Y-MP computers because they 
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Table 1. VBAND Comparisons for High-Speed Civil Transport Problem 
[16146 equations; Maximum semibandwidth — 594; Average semibandwidth = 319] 


Method 0 

CPU time 6 
(second) 

Wall time b 
(timef) 

Rate, 

MFLOPS 

Speedup 0 

NASA Langley CRAY- 2 computer in dedicated mode 

LLl 

27.9 

28.0 

73 

1.0 

LL6 

10.8 

10.8 

188 

2.6 

LL6L 

8.5 

8.6 

238 

3.3 

LL6LZ 

6.2 

6.2 

216 

4.5 

NASA Ames CRAY Y-MP computer in dedicated mode 

LLl 

14.7 

14.7 

138 

1.0 

LL6 

8.5 

8.6 

236 

1.7 

LL6L 

7.7 

7.7 

263 

1.9 

LL6LZ 

5.6 

5.6 

242 

2.6 


“Methods: 


LL1 variable-band Choleski, single SAXPY update 
LL6 variable-band Choleski, multiple (6) SAXPY update 

LL6L variable-band Choleski, multiple (6) SAXPY update, uses temporary storage for six 
update columns (stored in local memory on CRAY-2) 

LL6LZ variable-band Choleski, multiple (6) SAXPY update, uses temporary storage, skips 
SAXPY updates if all six scalar multipliers are zero 

^CRAY FORTRAN timing routines second and timef are used for CPU and wall clock times, 
respectively, which are given in seconds. 

^Speedup is computed using wall clock times relative to LL1. 


allow overlapping of memory accesses with simulta- 
neous use of both the add and multiply functional 
units. All columns of L that are accessed in L00P3 
of figure 2 on the CRAY-2 are stored in local mem- 
ory. Various values for r were tried with this method 
and the value of 6 was found to be best in most cases. 
As the level of loop unrolling increases, the amount 
of serial computation increases and the performance 
gains level off. If r is too large, the columns may 
not fit in the local memory of the CRAY-2. In prac- 
tice, only approximately 10K of the 16K words can 
be used because the operating system and compiler 
also use the local memory. 

VBAND examples . Table 1 gives a compari- 
son of the basic kji Choleski method, denoted as LL1, 
with several implementations that add the improve- 
ments described above. Runs were made using the 
HSCT example problem, described in section 4, on 
both the CRAY- 2 and the CRAY Y-MP computers 
in dedicated mode. The speedup of the vectorized 
LL1 method relative to LLl compiled with vector- 
ization inhibited is much greater than the speedups 
shown in table 1. All methods that begin with the 


symbols LL6 use the combined jki and kji Choleski 
forms already described with the level of loop un- 
rolling set to 6. The L designation after the 6 de- 
notes the use of temporary arrays to store the six 
columns of L used at each iteration of the outermost 
loop. On the CRAY-2 these arrays are stored in local 
memory by using the compiler directive cdid$ reg- 
file. Even though the CRAY Y-MP does not have 
local memory, the use of temporary arrays for the 
six columns reduced the indexing overhead and im- 
proved the computation rate by approximately 9 per- 
cent. (Compare LL6 and LL6L for the Y-MP runs.) 
The addition of zero checking, denoted by the letter 
“Z,” decreased the computation rate somewhat but 
reduced the overall time substantially. For the High- 
Speed Civil Transport problem, the factorization re- 
quired 2.033 billion operations without zero checking 
and 1.346 billion operations with zero checking. 

A comparison of the CRAY-2 and CRAY Y-MP 
times in table 1 shows that the techniques used to 
improve performance result in a greater improve- 
ment for the CRAY-2 than for the CRAY Y-MP. 
The LLl method is nearly twice as fast on the 
CRAY Y-MP compared with the CRAY-2 initially. 
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However, the local-memory implementations LL6L 
and LL6LZ run at nearly the same rate on both 
CRAY computers. The VBAND Choleski method, 
denoted by LL6LZ in table 1, is 4.5 times faster 
than the vectorized basic kji method, LL1, on the 
CRAY-2, but it is only 2.6 times faster than the LL1 
method on the CRAY Y-MP. This difference in re- 
sults for the two CRAY computers shows that al- 
gorithms that are carefully designed to exploit key 
features of the CRAY-2 architecture also run well 
on the CRAY Y-MP, but the converse is not nec- 
essarily true. The following paragraphs describe 
the approach used to parallelize this method on the 
CRAY-2 and CRAY Y-MP. 

Parallel P VBAND Choleski Method 

The sequential VBAND Choleski method is well 
suited for parallel implementation on multiprocessor 
computers such as the CRAY- 2 and CRAY Y-MP. 
FORTRAN listings of an autot asked subroutine and 
a FORCE subroutine for the PVBAND method are 
given in the appendix. The basic PVBAND Choleski 
method is shown in figure 3 in pseudocode instruc- 
tions. The jki part of the PVBAND method, i.e., 
the completion of columns k, fc + 1, ...,fc + r — lof 
L in figure 2, must be finished before the parallel up- 
dates using those columns begin. In this paper, the 
region surrounding the jki portion of the PVBAND 
method is referred to as the barrier region. A bar- 
rier region begins at a point at which all CPU’s must 
arrive before any can proceed; then the work within 
the barrier region is completed by one CPU before 
the remaining CPU’s resume execution. The number 
of barrier regions required for the PVBAND method 
is rc/r, where n is the number of equations and r is 
the level of loop unrolling. The number of computa- 
tions required within the barrier region is very small 
(between 0.6 and 2.3 percent for the three example 
problems considered in this paper) compared with 
the total number of computations required for the 
factorization. The multiple SAXPY updates can be 
computed in parallel with no synchronization or com- 
munication required. The following two subsections 
describe general differences in the implementation of 
the barrier regions and parallel loops using FORCE 
(macrotasking), microtasking, and autotasking. A 
separate section is devoted to detailed descriptions 
of the different parallel implementations. 

Barrier regions . In general, the FORCE im- 
plementations used in the PVBAND method assume 
a fixed number of tasks and must wait for all tasks to 
arrive at a barrier region, let one task complete the 
work within the barrier, and then exit sequentially. 
An excellent discussion of barrier algorithms used in 


L00P1 k = 1 to n — r — 1 in steps of r 
Begin barrier region 

{Complete columns fc, fc+1, . k + 
r — 1 of L} 

End barrier region 

lastrk = k + len(fc) — 1 

Parallel L00P2 j = k + r to lastrk 

IF Lj fc, L j^+r-l are not all zero 

THEN ’ ’ ’ 

{Update columns j of L using r 
columns of L via L00P3} 

L00P3 i .= j to lastrk 

Kjj + — — 

1 * ■ 

1 * ■kjjfc+r—l 

END L00P3 
END IF 

END Parallel L00P2 
END L00P1 

Begin barrier region 

{Finish any remaining columns of L 
(if n is not a multiple of r)} 

End barrier region 

Figure 3. Parallel PVT AND Choleski method. 

the FORCE language is given in reference 11. On 
the other hand, microtasking and autotasking imple- 
mentations use a less restrictive form of the barrier 
synchronization. In this case, the first active pro- 
cess that reaches a barrier region completes the work 
within that region and leaves the barrier after the 
work is completed. Other active processes that arrive 
at the barrier region must wait until the work is com- 
pleted or, if the work has already been completed, 
they immediately skip the barrier region and con- 
tinue execution. Each type of barrier implementation 
requires critical regions, that is, regions where vari- 
ables in shared memory must be incremented by one 
(and only one) CPU at a time. The shared variable 
must be locked to all other CPU’s during the time 
that it is being changed by a single CPU. The cost of 
locking and unlocking variables is a major part of the 
parallel overhead in the PVBAND implementations. 


Parallel loops . The implementations of the par- 
allel loops also differ substantially for the FORCE, 
microtasking, and autotasking versions. In the 
FORCE implementation, the parallel loop is distrib- 
uted to the fixed number of tasks in a wrap-around 
fashion. That is, task 1 completes the SAXPY 
updates associated with loop indices 1, 1 + p, 1 -j- 2p, 
etc., where p is the number of tasks. This approach 
has the advantage of evenly dividing the work in cases 


10 



c **** begin top of FORCE Barrier 
call lockon(barlck) 
if (ffnbar .It. np - 1) then 
ffnbar = ffnbar + 1 
call lockof f (barlck) 
call lockon(barwit) 
endif 

if (ffnbar .eq. np - 1) then 
c **** end top of FORCE Barrier 

c **** execute code inside barrier 


c **** begin bottom of FORCE Barrier 
endif 

if (ffnbax .eq. 0) then 
call lockof f (barlck) 
else 

ffnbar = ffnbar -1 
call lockof f (barwit) 
endif 

c **** end bottom of FORCE Barrier 

Figure 4. FORTRAN code for a standard force barrier. 

where the amount of work for each task is nearly the 
same. The disadvantage of a prescheduled assign- 
ment is that a large potential delay may be incurred 
on busy systems if all tasks are not active. In the 
microtasking and autotasking implementations, the 
parallel work is distributed only to active processes. 
This approach has the advantage of dynamic alloca- 
tion of the computing resources in a multiuser envi- 
ronment. The disadvantage of this approach is that 
each CPU must determine its next loop iteration in- 
side a critical region. A self-scheduled parallel loop 
is also available in FORCE which uses the same ap- 
proach as microtasking and autotasking. However, 
the FORCE critical regions, which use macrotasking 
lock subroutines described in the next section, are 
much more expensive than the critical regions imple- 
mented using microtasking and autotasking. 

Parallel Implementation Details 

Details of the various parallel implementations of 
the PVBAND method are discussed next. These de- 
tails show that autotasking and microtasking best 
exploit the parallel architecture of the CRAY com- 
puters, particularly in multiuser environments. The 
FORTRAN code used to implement the parallel con- 
structs is not usually seen by algorithm developers 
since it is inserted automatically at compile time. 
However, to illustrate the differences in the imple- 


mentation of the barrier region and parallel loop 
using FORCE, microtasking, and autotasking, the 
actual FORTRAN statements substituted for the 
FORCE statements and various compiler directives 
are examined. These details help to explain the 
sometimes large differences in performance that are 
observed in the results for the example problems us- 
ing the same basic parallel algorithm but with differ- 
ent parallel implementations. 

FORCE implementations . The FORCE lan- 
guage is described in detail in reference 6. Details 
of the use of FORCE will not be presented here, 
but the FORCE statements Barrier } End barrier , 
and Presched DO are discussed. In FORCE, crit- 
ical regions are implemented through macrotasking 
lock subroutines. The locks are identified as shared 
integer variables. The lock subroutines do not make 
use of the fast hardware semaphore registers although 
there are macrotasking subroutines that do use the 
semaphore registers to implement certain types of 
critical regions (refs. 3 and 4). A barrier subroutine 
is also available in macrotasking but is not used by 
the current version of FORCE. 

The FORCE statements Barrier and End bar- 
rier were used to implement the barrier region in 
the PVBAND method. The Barrier and End barrier 
statements are transformed by the FORCE prepro- 
cessor into the FORTRAN code shown in figure 4. 
The code listed in figure 4 uses subroutine calls to 
lock the shared variables barlck and barwit . These 
calls ensure that only one task at a time can incre- 
ment the counting variable ffnbar and prevent tasks 
from exiting the barrier region before the work inside 
the barrier is completed. When the routine lockon is 
called, a task attempts to lock the argument variable. 
If the argument variable is already locked, the task 
incrementing the counting variable is disconnected 
from the process and placed in the Blocked on a Lock 
queue. When the argument variable becomes un- 
locked, the library scheduler will move the tasks from 
the Blocked on a Lock queue into the Ready queue. 
Tasks in the Ready queue may then be assigned to 
logical CPU’s that are executed on physical CPU’s 
as they become available. If the argument vari- 
able is not locked, the calling task locks the variable 
and continues execution. Estimates of lock overhead 
for both the CRAY-2 (ref. 3) and the CRAY Y-MP 
(ref. 4) are described below. The CRAY-2 requires 
200 clock cycles when the variable to be locked is un- 
locked, whereas the CRAY Y-MP requires 400 clock 
cycles. If the variable is locked, the CRAY-2 over- 
head is 4000 clock cycles plus the time that a task 
must wait for the variable to become unlocked. The 
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c **** begin top of alternate FORCE 
Barrier 

if (me . eq. 1) then 
ist = 2 

10 continue 

do 20 i = ist, np 

if (lcord(i) . ne. k) then 
ist = i 
goto 10 
endif 

20 continue 

c **** end top of alternate FORCE 
Barrier 


execute code inside barrier 


c **** begin bottom of alternate FORCE 
Barrier 

lcord(l) - k 
else 

30 continue 

lcord(me) = k 
if (lcord(l) .ne. k) then 
goto 30 
endif 
endif 

c **** end bottom of alternate FORCE 
Barrier 

Figure 5. FORTRAN code for an alternate FORCE barrier. 

corresponding CRAY Y-MP overhead is 1500 clock 
cycles plus the wait time. 

The FORCE barrier is implemented using two 
locks, barlck and barwit The initial state, before each 
program call to the barrier, has barlck unlocked and 
barwit locked. At the beginning of the FORCE bar- 
rier, the first task to reach the call lockon statement 
locks the variable barlck and increments the counter 
ffnbar . That task then unlocks variable barlck and 
attempts to lock the second barrier shared variable 
barwit , but since barwit is initialized in the locked 
condition, that task must wait for barwit to be un- 
locked. As soon as barlck is unlocked, another task 
can lock it, increment the counting variable, unlock 
variable barlck , and wait for barwit to become un- 
locked. When the last task arrives at the barrier 
statement (i.e., ffnbar = np 1, where np is the num- 
ber of tasks), the variable barlck is left locked and 
the FORTRAN code between the Barrier and End 
barrier statements is executed while the remaining 
np — 1 tasks are still cither suspended or trying to 


lock variable barwit When the single task finishes 
the sequential FORTRAN code, the counting vari- 
able ffnbar is decremented and barwit is unlocked. 
One of the remaining tasks now relocks barwit (which 
keeps the remaining tasks still trying to lock bar- 
wit) and continues execution by first decrementing 
the counting variable ffnbar } unlocking barwit , and 
continuing execution. The last task to lock barwit 
(i.e., when ffnbar = 0) unlocks barlck and continues 
execution past the barrier. This method ensures that 
all tasks arrive at a barrier region before the work in- 
side the barrier work is started, and no task leaves 
the barrier region until the work inside the barrier 
region is completed. 

The standard FORCE barrier described in the 
preceding paragraph is expensive on any parallel 
system where the cost of locking shared variables 
is high or where the specified number of tasks may 
not be concurrently executing. The cost of locking 
shared variables is eliminated by using an alternate 
FORCE barrier, shown in figure 5. This barrier was 
developed by Jones (ref. 12). In this barrier a shared 
array, lcord() } is initialized once outside of LOOP1 
in figure 3 by using a standard FORCE barrier. The 
shared array is then used for the barrier region within 
LOOP1 in figure 3. Each task is associated with one 
array location (i.e., 1 through p) and writes only to 
that location. At the beginning of the barrier region, 
tasks 2 through p write the value of the LOOP1 
iteration variable k to their respective locations in 
array Icord. Task 1 checks locations 2 through p in 
Icord and executes the code within the barrier region 
only after locations 2 through p in Icord are equal 
to k . Tasks 2 through p check location 1 of array 
/cord, continuing in a loop until task 1 completes 
the barrier region and writes the value of the loop 
iteration variable k to location 1 of Icord. Tasks 
are not automatically suspended while waiting for 
appropriate values in the Icord array since no calls to 
lock routines are used. However, excessive CPU time 
may accumulate when the number of tasks specified 
at run time is greater than the number actually 
running concurrently. For this reason, this barrier is 
used only on lightly loaded systems or in dedicated 
mode. This type of barrier synchronization is useful 
within a loop such as in the PVBAND method, where 
the outer-loop index k can be used for the test value 
of the Icord array. It is not a general-purpose barrier. 

The parallel loop in the PVBAND method is im- 
plemented in FORCE using the Presetted DO state- 
ment. This statement is translated into a FORTRAN 
DO loop by the FORCE preprocessor. For example, 
the FORCE statement 

Presched DO 1 I=ISTART, IEND 
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c **** begin top of process control 

structure 

cmic$ process 

cdir$ suppress (list of variables) 
if (utqbcs (-1)) then 

utqitr = utqif a(l , utqitr) 
if (utqitr . eq. -1) then 
c **** end top of process control struc- 
ture 


c **** execute code inside control 
structure 


c **** begin bottom of process control 
structure 
cmic$ end process 
end if 

cdir$ suppress (list of variables) 
call utqecs 
endif 

c **** end bottom of process control 
structure 

Figure 6. Microtasking code for the process and end process 
directives. 

is translated to the following FORTRAN DO loop: 

DO 1 I=ISTART+ME-1 , IEND ,NP 

The variable NP is the number of tasks specified 
by the user at run time, and the task identification 
variable ME is an integer between 1 and NP. This 
FORTRAN loop assigns each iteration of the DO 
loop to the specified number of tasks in a wrapped 
fashion. 

Microtasking implementations. Microtask- 
ing is accomplished by inserting compiler directives 
in FORTRAN source code to identify control struc- 
tures. The barrier regions in figure 3 are implemented 
using the cmic% process and cmic% end process direc- 
tives, and the parallel loop is implemented using the 
cmic% do global directive. The code containing the 
microtasking directives is preprocessed so that the 
FORTRAN code implementing each control struc- 
ture is inserted. In the following paragraphs, the 
code generated by the preprocessor for both control 
structures is discussed. 

The process and end process control structure di- 
rectives bound a barrier region that is to be com- 
pleted by only one CPU before any additional CPU’s 


may proceed past the barrier region. In the micro- 
tasked algorithm, one CPU completes the r columns 
of L at each outer loop iteration, and only one CPU 
finishes the leftover work at the end. Only processes 
currently active must wait at the end of the process 
control structure. This approach differs significantly 
from the FORCE barrier described in the previous 
paragraphs where all tasks must reach the barrier 
before the sequential code can begin and all tasks 
must leave the barrier sequentially to begin execut- 
ing the next section of code after the sequential code 
is completed. As soon as the sequential code is com- 
pleted, all processes may proceed to the parallel loop, 
including any additional processes that may become 
active. The FORTRAN code inserted by the micro- 
tasking preprocessor for the process and end process 
directives is discussed next. 

Figure 6 shows the FORTRAN code generated 
by premult , the microtasking preprocessor, for the 
process control structure. The preprocessor replaces 
the process directive with a comment and inserts the 
multitasking code. The cdir$ suppress directive is in- 
serted by the preprocessor at the beginning and end 
of the barrier region to force all variables referenced 
within the barrier region from registers into shared 
memory. This ensures that all CPU’s have access to 
the most current values of the shared variables. The 
subroutine utqbcs is called by each CPU to determine 
if the control structure is eligible to be processed. 
The control structure is eligible for processing if the 
control structure has not been started or if it has been 
started but not yet completed. A shared variable is 
used to indicate which control structure is currently 
active. An additional shared variable is used to up- 
date the number of active processes within the con- 
trol structure. The code within a process directive is 
executed by only one CPU, but all other CPU’s must 
wait until the sequential code is completed. For ex- 
ample, if two of four CPU’s are available, and hence 
two CPU’s attempt to execute one of the barrier re- 
gions, the first CPU to reach the barrier region will 
begin executing the code. The second CPU will call 
routine utqecs which either suspends the process tem- 
porarily or just delays the second CPU until the sin- 
gle task is completed. As soon as the barrier region is 
completed, both CPU’s proceed to the next section 
of code. If a third CPU becomes available and arrives 
at the same barrier region after the work within the 
barrier region is completed, routine utqbcs will return 
a value of false and that CPU will skip the entire bar- 
rier region and go on to the next section of code. The 
implementation of this scheme uses special shared 
and private variables, hardware semaphore registers 
to lock appropriate variables, and assembly coded 
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c **** begin top of do global control 

structure 

cmic$ do global 

utqtrips = comp®( (lastr) - (k+6)) 
if (utqbcs (utqtrips) then 
if (utqtrips . It . 0) then 

utqitr - utqif a(l ,utqitr) 
utqdummy = comp® (utqitr) 
utqoff = (lastr)+l 
j = utqoff + utqitr 
if (utqdummy .ge. 0) then 
20 continue 

c **** end top of do global control 
structure 


c **** execute code inside control 
structure 


c **** begin bottom of do global control 
structure 

utqitr = utqifa(l , utqitr) 
utqdummy = comp® (utqitr) 
j = utqoff + utqitr 
if (utqdummy .ge. 0) goto 
20 
endif 
endif 

call utqecs 
endif 

c **** end bottom of do global control 
structure 

Figure 7. Microtasking code for the do global directive. 

routines such as utqbcs , utqifa , and utqecs to effi- 
ciently carry out the necessary synchronization steps. 

The do global directive is used to implement the 
parallel loop in figure 3. Each iteration of LOOP 2, 
the multiple SAXPY updates, is assigned one at a 
time to each active process. This mode of parallel 
execution differs from the FORCE implementation 
that distributes the iterations of LOOP2 to each task 
in a predetermined sequence. The total number of 
loop iterations is computed from the limit variables in 
the FORTRAN DO loop. The particular loop index 
j executed by each CPU is determined by reading 
a shared variable within a critical region and then 
incrementing that variable. 

Figure 7 shows the code generated by premult for 
the do global control structure. CPU's that reach the 
parallel DO loop will begin execution of the loop as 
long as routine utqbcs returns a value of true . This 


means that even CPU’s that may become available 
after much of the work within the parallel region is 
completed may still be used in the parallel loop. Each 
CPU determines its current iteration value utqitr 
through calls to the routine utqifa . The routine 

utqifa maintains a shared variable used to determine 
iteration values. The shared variable does not appear 
in the FORTRAN code in figure 7. All CPU’s 
that enter the parallel DO loop continue to execute 
different loop iterations until the loop counter utqitr 
gets to zero or to a positive value. Each CPU finishes 
its last available loop iteration and then calls routine 
utqecs , which delays all CPU’s until all work in the 
parallel region is completed. 

The code displayed in figures 6 and 7 was pro- 
duced on the CRAY-2. The corresponding code gen- 
erated on the CRAY Y-MP is different because of 
hardware differences and the greater use of intrin- 
sic functions that are used to test semaphore regis- 
ters and to read from and write to shared registers. 
These differences can cause noticeable changes in per- 
formance even though the same basic strategy is fol- 
lowed on both machines in implementing the barrier 
regions and parallel loops. 

Autotasking implementations . The depen- 
dency phase f pp of the autotasking compilation was 
not able to detect the parallelism in the VBAND 
method. Therefore, directives that were functionally 
equivalent to microtasking directives were inserted 
in the FORTRAN subroutine used to carry out the 
PVBAND method. Two autotasking implementa- 
tions for the PVBAND method were explored. 

In the first autotasking implementation, LOOP2 
in the PVBAND algorithm is preceded by the cmic% 
do all directive. The do all directive defines the start 
of a parallel region, and any code outside this region 
is to be executed only by the master process. The 
advantage of this approach is simplicity (i.e., only 
a single directive is required to parallelize the entire 
method), but the disadvantage is that each outer loop 
iteration requires additional overhead to start up a 
new parallel region. 

In the second autotasking implementation, the 
entire subroutine is declared a single parallel region 
using the cmic% parallel and cmic$ end parallel di- 
rectives. This approach is identical to the microtask- 
ing implementation already discussed, but different 
directives are used. All variables within the subrou- 
tine are explicitly declared to be shared or private 
within the parallel directive. The autotasking di- 
rectives cmic% case and cmic% end case perform the 
same function as the microtasking process and end 
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process directives, whereas the cmic% do parallel di- 
rective corresponds to the microtasking do global di- 
rective. All code outside the control structures, but 
inside the parallel region, is executed by all active 
processes. In experiments, the second autotasking 
method was found to be more efficient and was used 
for all autotasking results presented in this paper. 

The autotasking translation phase, generated by 
/rap, produces code very similar to the code produced 
by premult using microtasking. The autotasked sub- 
routine has additional code generated to handle the 
creation of parallel regions. In addition, separate 
code segments are created for the master and the 
slave processes. The code generated for the auto- 
tasking do parallel directive is the same as the code 
for the microtasking do global directive. Likewise, the 
code generated for the autotasking case directive is 
the same as the code for the microtasking process di- 
rective, except for some differences related to the sup- 
press directive. In microtasking, the suppress direc- 
tive, at the beginning of the process control structure, 
forces all variables that will be read inside the control 
structure to be written from registers to memory. At 
the end of the process control structure, all variables 
that have been modified are again written to main 
memory. In autotasking, only shared variables are 
subject to the suppress directive that precedes and 
follows the case directive. 

CRAY-2 local-memory considerations . The 
local-memory version of the PVBAND method re- 
quires some additional considerations. The columns 
of L, completed inside the barrier region in the 
PVBAND method, are computed by only one CPU 
but are used by all other CPU’s. Since the CRAY-2 
and CRAY Y-MP are shared- memory machines, 
these columns are accessible to all CPU’s. How- 
ever, the fast local memory cache on the CRAY-2 
is not shared between CPU’s and so each CPU must 
copy the columns into local memory before begin- 
ning the parallel SAXPY updates. Two approaches 
were tested to minimize the extra cost of copying the 
columns of L into local memory. No noticeable dif- 
ference was observed for the two approaches in the 
microtasked and autotasked implementations. How- 
ever, for the FORCE implementations, a significant 
"difference was observed. 

The first approach was to complete all r columns 
inside a single barrier, or control structure. The CPU 
-completing the r columns copies them into its lo- 
cal memory as soon as each column of L is com- 
pleted. The remaining CPU’s copy all six columns 
in one loop as soon as the single CPU finishes the 
barrier region. The second approach was to use r 
barrier or control structures, allowing the idle CPU’s 


to copy each column into local memory as soon as 
each column is completed. This approach reduces 
the time required at the end of the barrier to copy 
the columns into local memory, but at the expense of 
more barriers and an increased number of DO loops 
for the copying operation. Because of the high cost 
of FORCE barrier statements, the second approach 
was much slower than the first. In all the results pre- 
sented in section 5 for the CRAY- 2, the first approach 
was used for the FORCE implementations, whereas 
the second approach was used for the microtasking 
and autotasking implementations. 

Amdahl’s law . The degree of parallelism for any 
method is always limited by the number of sequen- 
tial computations relative to the number of parallel 
computations. A much-used measure of the effect 
of sequential operations on the parallel efficiency of 
a given method is Amdahl’s law, wffiich states that 
the theoretical maximum speedup S possible for a 
parallel algorithm is given by 


wdiere p is the number of processors and w s is the 
percentage of time required by sequential compu- 
tations. (See ref. 13 for a thorough discussion of 
several measures of the degree of parallelism.) For 
the PVBAND method, the jki portion at each outer 
loop iteration is the sequential part of the method, 
wffiereas all remaining computations are carried out 
in parallel in LOOP3 (fig. 3). Maximum theoretical 
speedup estimates for the PVBAND method for each 
of the three example structural analysis problems are 
given in table 2 by using equation (4). The percent- 
ages used for the computations in table 2 were ob- 
tained by inserting counting variables in the barrier 
region of code in the autotasked and microtasked im- 
plementations. The maximum speedup estimations 
assume that the sequential computations are carried 
out at the same rate as the multiple SAXPY up- 
dates in LOOP3 and also that no overhead exists for 
the parallel loop and the barrier regions. The actual 
speedups attained will always be less than the esti- 
mates in table 2, but values close to these estimates 
indicate good parallel implementations. 

PVBAND examples . Table 3 gives a compar- 
ison of parallel and sequential timings for the High- 
Speed Civil Transport problem. The PVBAND im- 
plementations are denoted by appending the letters 
A, M, F, and J to the VBAND implementation sym- 
bols in table 3. The letters correspond to the au- 
totasking, the microtasking, the FORCE using the 
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Table 2. Effect of Amdahl’s Law on Maximum Theoretical Speedup for PVBAND Method 


Number of 
processors, p 

Maximum theoretical speedup for problem — 

HSCT a 

SRB 6 

3-D panel c 

2 

1.96 

1.96 

1.99 

4 

3.74 

3.75 

3.93 

8 

6.89 

6.93 

7.68 

16 

11.90 

12.03 

14.70 


a HSCT: High-Speed Civil Transport, sequential operations 2.3 percent of total operations 
for factorization. 


5 SRB: Space Shuttle solid rocket booster, sequential operations 2.2 percent of total opera- 
tions for factorization. 

r 3-D panel; Cross-ply composite laminate, sequential operations 0.59 percent of total 
operations for factorization. 


standard barrier, and the FORCE using the special- 
purpose barrier implementations of the PVBAND 
method, respectively. Runs were made on the 
CRAY-2 in dedicated mode comparing the paral- 
lel PVBAND implementations with the correspond- 
ing sequential VBAND implementations described 
previously. Parallel speedups are computed using 
the PVBAND wall clock times and the sequential 
VBAND wall clock times for the same method. 
Speedups are often computed by dividing the parallel 
CPU time by the parallel wall clock time, but such 
speedups do not accurately reflect the cost of par- 
allel processing. A comparison of the CPU time 
for each PVBAND implementation with the CPU 
time for the sequential VBAND implementation of 
the same method shows that the total CPU time is 
always greater for the parallel implementations than 
for a single-processor implementation of the same al- 
gorithm. If the CPU time for the parallel implemen- 
tations is used to compute parallel speedups, imple- 
mentations with the largest CPU time may appear to 
have the greatest parallel efficiency. For example, in 
table 3 the parallel speedup for the LL6LF1 FORCE 
implementation would be 1.9 if computed by divid- 
ing the parallel CPU time by the parallel wall clock 
time. However, the actual wall clock time for this 
implementation is greater than the wall clock time 
for the sequential VBAND implementation. 

The FORCE PVBAND implementations, which 
used the special alternate barrier described previ- 
ously, were the fastest in dedicated mode and have 
the highest parallel speedups. The faster times are 
expected since no locking of variables is required 
for the special FORCE barriers and no tasks are 


ever suspended while waiting for sequential work to 
be completed. However, the special barrier works 
effectively only in a dedicated environment. The 
FORCE implementations that used the standard bar- 
riers were the slowest of the PVBAND implementa- 
tions. The LL6LF1 implementation, which uses one 
barrier for each column computed in the jki portion 
of the PVBAND method, is very inefficient on the 
CRAY-2. The wall clock time for this method was 
greater than the single-processor time of the corre- 
sponding VBAND implementation. CPU times were 
also much higher for the FORCE implementations on 
the CRAY-2 than for the microtasking and autotask- 
ing implementations. This difference reflects the high 
cost of barriers in the FORCE implementations. The 
differences between autotasking and microtasking arc 
small, with autotasking slightly faster for these runs. 

The parallel speedups shown in table 3 are 
affected by both the number of sequential compu- 
tations and the overhead associated with the im- 
plementation of barrier regions. Table 4 shows the 
percentage of sequential operations for this problem. 
The LL1 methods have a lower number of sequential 
computations compared with the LL6 methods but 
they also require six times as many barrier regions. 
As a result, the parallel speedups in table 3 are, in 
some cases, higher for the LL6 methods than for the 
LLl methods. This is especially true for the FORCE 
implementations that use the standard FORCE bar- 
rier. The results in table 3 also show that the use 
of temporary arrays for local memory versions and 
zero-checking reduces parallel speedups. However, 
both local memory and zero checking reduce total 
wall clock time and are therefore beneficial. 
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Table 3. PVBAND Method Timings for High-Speed Civil Transport Problem 


NASA Langley CRAY-2 computer in dedicated mode; 
16 146 equations; Maximum semibandwidth = 594; 
Average semibandwidth = 319 


Method a 

CPU time 
(second) 

Wall time 
(time}) 

Speedup^ 

Single-processor VBAND Choleski method 


27.87 

27.95 



10.79 

10.82 



8.53 

8.55 


LL6LZ 

6.20 

6.22 


^ Parallel PVBAND method using autotasking j 

LL1A 


8.29 

3.37 

LL6A 


3.13 

3.46 

LL6LA 

9.91 

2.67 

3.20 

LL6LZA 

7.77 

1.96 

3.17 

Parallel PVBAND method using microtasking 

LL1M 

33.53 

8.54 

3.27 

LL6M 

12.90 

3.67 

2.95 

LL6LM 

10.06 

2.89 

2.96 

LL6LZM 

7.85 

2.15 

2.89 

Para 

lei PVBAND method using standard FORCE barrier j 

LL1F 

50.86 

17.18 

1.63 

LL6F 

14.26 

3.98 

2.72 

LL6LF1 

54.52 

28.52 


LL6LF2 

13.37 

4.25 

2.01 

LL6LZF 

11.42 

3.94 

1.58 

Parallel PVBAND method using special FORCE barrier 0 

LL1J 

30.26 

7.64 

3.66 

LL6J 

12.17 

3.06 

3.54 

LL6LJ 

9.56 

2.41 

3.55 

LL6LZJ 

7.15 

1.79 

3.47 


a VBAND method names appended by letters A, M, F, and J to denote autotasking, 
microtasking, FORCE with standard barrier, and FORCE with special alternate barrier 
implementations of PVBAND method, respectively. 

LL6LF1 - uses six barrier statements for each outer loop iteration in jki portion of 
PVBAND method 

LL6LF2 - uses only one barrier statement to finish all six columns in jki portion of 
PVBAND method 

^PVBAND speedups calculated using the PVBAND wall clock time and the corresponding 
w r all clock time for the single-processor method. 

c Special FORCE barrier avoids use of critical regions that require calls to routines that lock 
the shared variables. 


Solution of Triangular Systems 

The solution of the triangular systems is a for- 
ward solution (eq. (2a)), followed by a backward so- 
lution (eq. (2b)). As in the factorization of K, there 


are several possible implementations of the triangu- 
lar solutions. Since the factorization is more costly 
than the triangular solutions, the data structure used 
for the factorization often determines the implemen- 
tations of the forward and backward solutions. In 
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Table 4. Number of Computations and Memory Requirements for Example Problems 


Method 

Factorization 
(total operations) 

Sequential, 
percent of total 

Memory 
(64-bit words) 

High-Speed Civil Transport aircraft 

(16146 equations; Max. semibandwidth = 594; Av. semibandwidth = 319) 

LLl 

2 033084 710 

0.09 

5160501 

LL6 

2 033161281 

1.5 

5160591 

LL6LZ 

1345866 285 

2.3 

5160591 

Composite cross-ply laminate with hole 

(11929 equations; Max. semibandwidth = 1597; Av. semibandwidth = 1081) 

LL6LZ 

13 098 674109 

0.59 

12 901825 


Space Shuttle solid rocket booster 


(54 870 equations; Max. semibandwidth = 

= 558; Av. semibandwidth = 355) 

LL6LZ 

5 427998155 

2.2 

19 522 323 


the variable-band data structure, the lower trian- 
gular part of L is stored by columns. As a result, 
the forward solution is carried out using SAXPY op- 
erations. This so-called column-sweep algorithm is 
shown in figure 8(a). For the backward solution, the 
columns of L are now the rows of L 1 ; therefore, a 
different approach that uses inner products is used 
to ensure stride one memory accesses. Figure 8(b) 
illustrates this inner product algorithm. 

L00P1 j = 1 to n 

IF fj = 0 THEN skip L00P2 
z j ~ fj/kjj 
lastr — j + len(j) - 1 
L00P2 i = j + 1 to lastr 
f? ~ f) T i K j * z j 
END L00P2 

END L00P1 

(a) Column- sweep forward solution, Lz = f. 

L00P1 j = n to 1 

lastr = j + len(j) — 1 
L0DP2 i = j + 1 to lastr 

Z i — Z i L j -j ^ 11, 

End LOOP2 ’ 

U J = 

End LOOP1 

(b) Inner product backward solution, L T u — z. 

Figure 8. Forward-and-backward triangular solution 
algorithms. 

The zero-checking option used to reduce oper- 
ations in the factorization can also be applied to 
the forward solution, as shown in figure 8(a). For 
problems where the right-hand side vector has few 


nonzero values, this strategy can significantly reduce 
the operation counts for the forward solution. The 
zero-checking strategy cannot be used effectively for 
the inner products in the backward solution. The 
loop-unrolling techniques described for the factoriza- 
tion were also applied to both the forward and back- 
ward solutions. The computation rates for the tri- 
angular solutions were improved significantly on the 
CRAY-2 as a result of loop unrolling, but on the 
CRAY Y-MP the improvement was very small. This 
difference is due to the positive effect of chaining on 
the CRAY Y-MP which caused the initial triangu- 
lar solutions to be nearly two times faster than the 
CRAY-2 versions. Loop unrolling made up much 
of the difference on the CRAY-2, but the overhead 
of additional sequential computations required for 
the loop-unrolling version limited the amount of the 
improvement. A FORTRAN listing of the forward 
and backward solutions using loop unrolling and zero 
checking is given in the appendix. 

For the parallel implementations of the PVBAND 
method, the triangular solutions were carried out on 
a single processor. Times for the triangular solutions 
are given in section 5 for the three structural analysis 
problems. 

Out-of-Core Extensions f 

An out-of-core version of both the VBAND and 
PVBAND methods is under development. The out- 
of-core version permits the solution of problems 
that are too large to fit in main memory or al- 
lows users to solve larger problems interactively on 
machines where interactive memory limits are im- 
posed. The out-of-core method is designed to take 
advantage of the large memories available on today’s 
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Finite element model 

(Model size is half of sketch shown) 

2851 nodes 

5189 two-noded rod elements 
4329 four-noded quadrilateral elements 
1 14 three-noded triangular elements 



Stiffness matrix 

16 146 equations 
499 505 coefficients 

594 max. semibandwidih 
319 av. semibandwidth 


Figure 9. Finite element model of High-Speed Civil Transport aircraft. 


supercomputers. This is accomplished by using the 
buffer in and buffer out CRAY FORTRAN I/O state- 
ments and three buffers to store blocks of the factored 
matrix. The input matrix is stored in large blocks by 
records in an unformatted file. The I/O required is 
carried out using one of the three buffers in a rotat- 
ing fashion, and computations required for the fac- 
torization proceed in the remaining buffers while I/O 
operations are underway. The data are read in once 
and written out after an entire buffer is finished, and 
no movement of data in memory is required. The 
overlapping of I/O and computations in this method 
results in factorization times for the out-of-core ver- 
sion that are very close to the factorization times for 
the in-core version of the factorization. The main 
difference in wall clock times for the in-core meth- 
ods compared with the out-of-core methods occurs 
in the triangular solutions. The triangular solutions 
require two passes through the factored matrix, and 
the number of computations per coefficient is only 
two. As a result, the time required for the triangular 
solutions is increased dramatically for the out-of-core 
methods. 


4. Structural Analysis Applications 

In this paper, three representative structural 
analysis problems are used to compare the differ- 
ent parallel implementations of the variable-band 
Choleski method. The three problems represent full- 
scale problems that can all be stored in main memory 
on the CRAY-2 and CRAY Y-MP computers. The 
example structural analysis problems described in 
this work were carried out using the Computational 
MEchanics Testbed (COMET), a large-scale struc- 
tural analysis software system developed by the Com- 
putational Mechanics Branch at the LaRC (refs. 14 
and 15). Although the structural response determi- 
nation is the primary goal of structural analysis, the 
three problems considered herein are used primarily 
to assess the performance of the parallel/vector equa- 
tion solvers. Hence, only a brief description of each 
structural problem is provided. 

High-Speed Civil Transport Aircraft 

Projected trends in travel to the Pacific Basin 
have led to a renewed national interest in the 
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Finite element model 


4114 nodes 

3328 eight-noded hexahedral 

elements arranged in 1 6 layers 

Stiffness matrix 

1 1 929 equations 
397 139 coefficients 

1597 max. semibandwidth 
1081 av. semibandwidth 



Cutout view showing stacking 
of elements through thickness 


Figure 10. Finite element model of cross-ply laminate with a hole. 


development of a High-Speed Civil Transport (HSCT) 
aircraft (refs. 16 and 17). The design of such a vehicle 
will require an integrated analysis approach involv- 
ing both structures and aerodynamics. To accelerate 
the development of mathematical models of various 
structural configurations, a parameterized finite ele- 
ment model has been developed. The finite element 
model (fig. 9) used herein represents the symmet- 
ric half of the HSCT aircraft. The model involves 
2851 nodes, 5189 two-noded rod elements, 4329 four- 
noded quadrilateral elements, and 114 three-noded 
triangular elements. The linear static response is 
determined for the case of a wingtip loading. The 
linear system for this problem has 16 146 equations 
with a maximum semibandwidth of 594 and an av- 
erage semibandwidth of 319. The nodes in the finite 
element model w 7 ere ordered using the reverse Cuthill- 
McKee algorithm implemented in the COMET soft- 
ware to minimize bandwidth. 

Cross-Ply Composite Laminate With a 

Hole 

A detailed stress analysis of composite structures 
is often required to accurately determine through- 
the-thickness (or interlaminar) stress distributions. 
Some sources of interlaminar stress gradients include 
free-edge effects, holes, and ply drop-offs (e.g., ta- 


pered stiffener attachment flanges). To study these 
effects, three-dimensional finite element models are 
frequently used. To study the performance of these 
solvers for three-dimensional finite element models, 
the overall structural response of an 8-ply cross-ply 
composite laminate wfith a central circular hole was 
considered. The finite clement model (fig. 10) used 
herein has 4114 nodes and 3328 eight-noded hexa- 
hedral solid elements arranged in 16 layers. This fi- 
nite element model is adequate for overall response 
characteristics but must be refined in order to deter- 
mine the interlaminar stress state accurately. The 
linear system for this problem has 11929 equations 
with a maximum semibandwidth of 1597 and an av- 
erage semibandwidth of 1081. The larger bandwfidth 
of this problem is characteristic of three-dimensional 
models. The nodes in the finite element model were 
ordered using the reverse Cuthill-McKee algorithm 
implemented in the COMET software to minimize 
bandwidth. 

Space Shuttle Solid Rocket Booster 

A preliminary assessment of the Space Shuttle 
solid rocket booster (SRB) global shell response to se- 
lected prelaunch loads was presented in reference 18. 
The tw r o-dimensional shell finite element model used 
in this study was translated into a format compatible 
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Finite element model 

9205 nodes 

1273 two-noded beam elements 
9156 four-noded quadrilateral elements 
90 three-noded triangular elements 



Stiffness matrix 

54 870 equations 
1 310 973 coefficients 

558 max. semibandwidth 
355 av. semibandwidth 


Figure 11. Finite element model of Space Shuttle solid rocket booster. 


with the COMET software. The finite element model 
(fig. 11) involves 9205 nodes with 1273 two-noded 
beam elements, 90 three-noded triangular elements, 
and 9156 four-noded quadrilateral elements. The 
linear static response to the internal pressure load- 
ing in the SRB was obtained. The linear system for 
this problem has 54 870 equations with a maximum 
semibandwidth of 558 and an average semibandwidth 
of 355. The nodes in the finite element model were 
ordered using the bandwidth minimization options 
in PATRAN software developed by PDA Engineer- 
ing (ref. 19). The PATRAN bandwidth optimization 
reduced the bandwidth for this problem more than in 
previously reported results (refs. 8 and 9) which used 
the reverse Cuthill-McKee algorithm implemented in 
the COMET software. As a result, the number of 


operations required for the matrix factorization was 
reduced approximately 25 percent. 

5. Numerical Results and Comparisons 

In this section, timing results are presented 
for the matrix factorization stage of three struc- 
tural analysis problems comparing the sequential 
VBAND Choleski method and the parallel PVBAND 
Choleski implementations. The experiments were 
performed on the CRAY- 2 at the LaRC and on 
the CRAY Y-MP at the ARC. Results are pre- 
sented from computer runs on both the CRAY-2 
and CRAY Y-MP computers made first during dedi- 
cated single-user mode and second from runs made in 



Method 

CPU time 
(second) 

Wall time 
(time/) 

Rate, 

MFLOPS 

Speedup 

1 High-Speed Civil Transport aircraft (16 146 equations; 

| Max. semibandwidth = 

594; Av. semibandwidth = 319) 

VECTOR 

6.20 

6.22 

216 


AUTO 

7.77 

1.96 

687 

3.17 

MICRO 

7.85 

2.15 

626 

2.89 

SFORCE 

11.42 

3.94 

342 

1.58 

JFORCE 

7.15 

1.79 

752 

3.47 

Composite cross-ply laminate with hole (11 929 equations; 

j Max. semibandwidth = 

[597; Av. semibandwidth = 1081) 

VECTOR 

47.92 

48.07 

272 


AUTO 

50.52 

12.84 

1020 

3.74 

MICRO 

51.22 

13.04 

1004 

3.69 

SFORCE 

55.34 

17.57 

746 

2.74 

JFORCE 

51.27 

12.95 

1011 

3.71 

Space Shuttle solid rocket booster (54 780 equations; 

| Max. semibandwidth = 

558; Av. semibandwidth = 355) j 

VECTOR 

24.83 

24.91 

218 


AUTO 

30.25 

7.78 

698 

3.20 

MICRO 

31.07 

8.01 

678 

3.11 

SFORCE 

43.87 

15.74 

345 

1.58 

JFORCE 

29.20 

7.40 

734 

3.37 


(a) Comparison of vector and parallel implementations of 
Choleski factorization for three example problems. 


H VECTOR - no parallelization 
El AUTO - autotasking 
RS&BI MICRO - microtasking 

SFORCE - macrotasking, standard FORCE barrier 
JFORCE - macrotasking, special FORCE barrier 
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(b) Comparison of wall clock times (normalized). 


Figure 12. CRAY-2 factorization times in dedicated mode with a maximum of four CPU’s. 


normal multiuser environments. The runs made 
in dedicated mode show the maximum computa- 
tion rates possible for the three problems using the 
PVBAND method and measure parallel overhead 
without additional delays caused by other programs 
executing at the same time. The runs made in mul- 
tiuser mode measure the wall clock times for both 
the VBAND and PVBAND methods when other 
programs are executing. The results show that the 
PVBAND method is efficient on the CRAY comput- 
ers and significantly reduces wall clock time in both 
multiuser and dedicated modes. 

In the results that follow, the names VECTOR, 
AUTO, MICRO, SFORCE, and JFORCE are used 
to denote the vector implementation of the VBAND 
method and the autotasking, microtasking, and two 
FORCE implementations of the PVBAND method, 
respectively. SFORCE denotes the use of the 
FORCE Barrier statement, and JFORCE denotes 
the FORCE implementation that uses the alternate 
barrier described in section 3. All parallel methods 
compared in this section use loop unrolling to level 6 


and the zero-checking option described in section 3. 
Table 4 shows the number of computations, percent- 
age of sequential operations during factorization, and 
memory requirements for each of the example struc- 
tural analysis problems. 

Timing Considerations 

The measure of time for the comparison of algo- 
rithms is an important consideration, particularly on 
multiuser parallel computers like the CRAY comput- 
ers which offer several timing options. For sequential 
algorithms, CPU time is the most often quoted mea- 
sure of time and is usually measured on the CRAY 
computers by the FORTRAN function second ( ). For 
parallel runs, second { ) returns a cumulative CPU 
time, summed for all processes and as such provides 
no useful measure of parallel speedup. A different 
timing function, tsecnd(), may be used to measure 
individual CPU time for a given task. The CPU time 
for a routine executed by p processes may be taken as 
the largest CPU time returned for any single process. 
This routine can be used as an effective measure of 
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Method 

CPU time 
( second ) 

Wall time 
( time/) 

Rate, 

MFLOPS 

Speedup 

1 High-Speed Civil Transport aircraft (16 146 equations; j 

Max. semibandwidth = 

- 594; Av. semibandwidth = 319) J 

VECTOR 

5.50 

5.50 

245 


AUTO 

6.46 

1.62 

831 

3.40 

MICRO 

6.29 

1.57 

857 

3.50 

SFORCE 

6.98 

1.75 

769 

3.14 

JFORCE 

6.33 

1.58 

852 

3.48 

Composite cross-ply laminate with hole (1 1 929 equations; 1 

Max. semibandwidth = 

1597; Av. semibandwidth = 1081) j 

VECTOR 

46.33 

46.34 

283 


AUTO 

48.71 

12.18 

1075 

3.80 

MICRO 

48.11 

12.03 

1089 

3.85 

SFORCE 

48.69 

12.19 

1075 

3.80 

JFORCE 

49.14 

12.29 

1066 

3.77 

Space Shuttle solid rocket booster (54 780 equations; j 

j Max. semibandwidth 

= 558; Av. semibandwidth = 355) | 

VECTOR 

21.92 

21.93 

248 


AUTO 

25.56 

6.39 

849 

3.43 

MICRO 

24.83 

6.21 

874 

3.53 

SFORCE 

27.19 

6.80 

798 

3.23 

JFORCE 

25.10 

6.28 

864 

3.49 


(a) Comparison of vector and parallel implementations of 
Choleski factorization for three example problems. 


VECTOR - no parallelization 
t%"l AUTO - autotasking 
ESSSi MICRO - microtasking 

SFORCE - macrotasking, standard FORCE barrier 
ESI JFORCE - macrotasking, special FORCE barrier 





0 .25 .50 .75 1.00 

(b) Comparison of wall clock times (normalized). 


Figure 13. CRAY Y-MP factorization times in dedicated mode with a maximum of four CPU’s. 


the load balance for a given algorithm executed on a 
fixed number of processors, as in FORCE implemen- 
tations described in this paper. However, tsecnd { ) 
often does not provide an accurate measure of the 
total time required for parallel algorithms. Even on 
dedicated systems, the tsecnd{) time is often sub- 
stantially less than the wall clock time measured for 
the same process. This difference may lead to overly 
optimistic performance projections that cannot ac- 
tually be obtained even in a dedicated environment. 
The most reliable measure of actual achievable com- 
putation rates for given parallel algorithms on the 
CRAY computers is obtained using wall clock time 
(function time f) in a dedicated, single-user environ- 
ment. In this paper, only wall clock times are used 
to evaluate the performance of the parallel implemen- 
tations. Additional results are given that show the 
times reported by second ( ), and they are referred to 
as CPU time. The CPU times show in general that 
parallel runs accumulate more total CPU time than 
the corresponding CPU time for a single-processor 


run, and these times generally increase as the num- 
ber of processors increases. 

Dedicated Runs 

Dedicated, single-user runs were obtained for the 
three example problems to measure the maximum 
performance of the VBAND and PVBAND imple- 
mentations. For each run, the wall clock time 
was used to compute the MFLOP rate and paral- 
lel speedups were measured relative to the wall clock 
time for the VBAND method VECTOR. 

CRAY-2 results . Figure 12 shows CRAY-2 re- 
sults for all three structural analysis problems using 
the VBAND and PVBAND implementations. The 
JFORCE method, which uses the special FORCE 
barrier with no calls to macrotasking lock subrou- 
tines, is the fastest PVBAND implementation on two 
of the three problems. The SFORCE method, which 
uses the standard FORCE barrier, is the slowest 
PVBAND implementation, demonstrating the high 
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Method 

CPU time 
(second) 

Wall time 
(timef) 

Rate, 

MFLOPS 

Speedup 

High-Speed Civil Transport aircraft (16 146 equations; 

Max. semibandwidth 

- 594; Av. semibandwidth = 319) 

VECTOR 

5.50 

5.50 

245 


AUTO 

7.64 

.96 

1402 

5.73 

MICRO 

7.39 

.93 

1447 

5.91 

SFORCE 

10.93 

1.42 

948 

3.87 

JFORCE 

7.43 

.93 

1447 

5.91 

Composite 

cross-ply laminate with hole (1 1 929 equations; 

Max. semibandwidth = 

1597; Av. semibandwidth = 1081) 

VECTOR 

46.33 

46.34 

283 


AUTO 

51.32 

6.43 

2037 

7.21 

MICRO 

50.55 

6.34 

2066 

7.31 

SFORCE 

55.00 

7.96 

1646 

5.82 

JFORCE 

52.11 

6.52 

2008 

7.11 

Space Shuttle solid rocket booster (54 780 equations; 

1 Max. semibandwidth 

= 558; Av. semibandwidth = 355) 

VECTOR 

21.92 

21.93 

248 


AUTO 

29.69 

3.72 

1459 

5.90 

MICRO 

28.90 

3.64 

1491 

6.02 

SFORCE 

39.44 

4.95 

1097 

4.43 

JFORCE 

29.10 

3.64 

1491 

6.02 


(a) Comparison of vector and parallel implementations of 
Choleski factorization for three example problems. 


VECTOR - no parallelization 
E53 AUTO - autotasking 
MICRO - microtasking 

Egftl S FORCE - macrotasking, standard FORCE barrier 
[7v| JFORCE - macrotasking, special FORCE barrier 






(b) Comparison of wall clock times (normalized). 


Figure 14. CRAY Y-MP factorization times in dedicated mode with a maximum of eight CPU’s. 


cost of the lock subroutines used by the standard 
FORCE implementation, Over 1 GIGAFLOP was 
achieved on the largest bandwidth problem, the com- 
posite cross-ply laminate, for all but the SFORCE 
method. This problem has the lowest percentage 
of sequential operations in the barrier region and 
the longest vector lengths in the parallel update sec- 
tion. On the CRAY-2, autotasking (method AUTO) 
is slightly faster than microtasking (method MICRO) 
for all three problems. 

CRAY Y-MP results . Figures 13 and 14 

give timing results from runs on the CRAY Y-MP 
using four and eight processors in a dedicated mode. 
The eight-processor CRAY Y-MP results in figure 14 
show rates of over 1 GIGAFLOP for all three prob- 
lems and over 2 GIGAFLOPS for the cross-ply lam- 
inate problem. Parallel speedups for each of the 
problems compare favorably with the theoretical 
maximums computed in table 2. 

A comparison of the CRAY-2 results in figure 12 
and the CRAY Y-MP results in figure 13 shows that 


both the computation rates and parallel speedups are 
higher on the CRAY Y-MP than on the CRAY-2. 
These results may be due to the fact that the 
CRAY Y-MP has more hardware available than the 
CRAY-2 for multitasking purposes. There are nine 
clusters of shared registers for interprocessor com- 
munication and synchronization, each of which has 
8 shared address, 8 shared scalar, and 32 semaphore 
registers. In contrast, the CRAY-2 has only eight 
semaphore registers that synchronize common mem- 
ory references in multitasked jobs. 

The CRAY Y-MP performance of the FORCE 
implementations relative to the autotasking and mi- 
crotasking implementations is significantly improved 
over the CRAY-2 results. The FORCE times in fig- 
ure 13 from CRAY Y-MP runs using four CPU’s are 
very close to the autotasking and microtasking times. 
However, the FORCE times are noticeably slower 
than the autotasking and microtasking times as the 
number of processors is increased from four to eight 
(fig. 14). 
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On the CRAY Y-MP, the microtasked PVBAND 
implementation is slightly faster than the autotasked 
implementation. This result is reversed from the rela- 
tive performance of autotasking and microtasking on 
the CRAY-2. The PVBAND implementations on the 
CRAY Y-MP were compiled using the CFT77 ver- 
sion 4.0 compiler. Earlier results that used a previ- 
ous compiler had shown faster times for autotasking. 
The CRAY-2 compiler used for the results presented 
in this paper is CFT77 version 3.1. 

The times in figures 12-14 show that the 
PVBAND method is efficient and achieves a high per- 
centage of the maximum possible computation rate 
on both CRAY computers. The times for the Space 
Shuttle SRB problem are the lowest wall clock and 
CPU times yet achieved for this problem (refs. 8 
and 9). The rates achieved in the VBAND and 
PVBAND implementations use only FORTRAN sub- 
routines. Further improvements in the computa- 
tion rates can be achieved by writing parts of the 
PVBAND implementations in assembly code. In ad- 
dition, if the next group of r columns of L is com- 
pleted as part of the parallel update region in the 
PVBAND method at each stage, or outer loop it- 
eration, the barrier region can be eliminated. This 
change eliminates the barrier region and should im- 
prove the performance of the PVBAND method as 
the number of processors increases. These improve- 
ments are the subject of current and future studies. 

Multiuser Runs 

Although the dedicated runs indicate the maxi- 
mum benefit of parallel processing to a single user, 
the typical computing environment on multiproces- 
sor CRAY supercomputers is a multiuser mode with 
many users sharing the four to eight CPU’s. Many 
users may be logged in at any given time and may be 
interactively executing jobs, editing files, or submit- 
ting batch runs to one or more of several job queues. 
In this environment, users rarely have control of all 
the CRAY CPU’s for an entire job execution. The 
performance of multitasked jobs will differ across ma- 
chines according to the system parameters defined by 
each computing center. In order to see the effect of 
multitasking in a busy environment on the CRAY-2 
and the CRAY Y-MP, each problem was run multiple 
times using the VBAND and PVBAND implemen- 
tations. Multiple runs were submitted using script 
files during normal operating hours. The two largest 
problems were run in batch queues that are normally 
run after the normal working hours along with other 
large batch runs. The results shown in figures 15 17 
were taken from 10 runs of each problem. Both the 
CRAY-2 and CRAY Y-MP were running the CRAY 


UNICOS 5.1.9 operating system at the time of this 
study. The CRAY UNICOS 6.0 operating system in- 
corporates changes that should improve the perfor- 
mance of multitasked jobs in a batch environment. 

CRAY-2 results . Figure 15 shows the best, 
median, and worst elapsed times for all three prob- 
lems from runs on the CRAY- 2. One factor that 
significantly affects the performance of the VBAND 
and PVBAND implementations in multiuser mode 
is the memory-swapping size limit. For programs 
below the memory-swapping size limit, the entire 
matrix is swapped out of main memory more of- 
ten than for larger problems when tasks are inter- 
rupted. Larger problems are left in memory even 
though the CPU may be interrupted and dedicated 
to another user for a time period. This effect can 
be seen in figure 15 by comparing the CPU and wall 
clock times for the SRB problem, which is over the 
swapping size limit, with the HSCT and composite 
panel problems, which are under the swapping size 
limit. The best-case wall and CPU times for the 
SRB problem are nearly identical to the dedicated 
times for the VBAND method VECTOR. For the 
smaller problems, the best-case wall times were be- 
tween two and four times longer than the dedicated 
wall times for method VECTOR, indicating that the 
cost of memory swapping significantly increased the 
wall clock times. The wall clock times for the auto- 
tasking and microtasking PVBAND implementations 
indicate that these parallel implementations signifi- 
cantly reduce wall clock time when compared with 
the wall clock time for the VBAND VECTOR im- 
plementation for all three problems. The FORCE 
implementations did not reduce CPU time in the 
multiuser mode, and in most cases they significantly 
increased wall clock times relative to the VBAND 
method. The very poor performance of FORCE on 
the CRAY-2 is the result of the interaction between 
the high cost of standard FORCE barriers and the 
scheduling of a fixed number of tasks in a busy en- 
vironment. This apparently leads to more frequent 
swapping between memory and disks, thus greatly in- 
creasing the wall clock times. The multiuser times for 
the FORCE implementations on the CRAY-2 were 
as much as 35 times greater than the corresponding 
method VECTOR times in the worst case. The auto- 
tasking and microtasking implementation times were 
very close on all three problems, with the autotask- 
ing implementations giving the fastest best-case wall 
clock times on all three problems. 

CRAY Y-MP results . On the ARC CRAY 
Y-MP, the maximum number of CPU’s used for the 
single user’s program is set by default to four. Users 
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Method 

Best times: 
Wall /CPU 

Median times: 
Wall /CPU 

Worst times: 
Wall /CPU 

High-Speed Civil Transport aircraft (16 146 equations; 
Max. semibandwidth = 594; Av. semibandwidth = 319) 

VECTOR 

AUTO 

MICRO 

FORCE 

25.64/6.21 

8.92/7.06 

9.35/7.41 

51.17/13.88 

32.79/6.22 
11.51/6.97 
13.34/7.29 
253.95/ 15.10 

38.27/6.23 

14.09/6.99 

15.54/7.06 

1345.16/16.20 

Composite cross-ply laminate with hole (11 929 equations; 
Max. semibandwidth = 1597; Av. semibandwkfth = 1081) 

VECTOR 

AUTO 

MICRO 

FORCE 

98.41/47.64 

38.79/49.35 

39.23/50.03 

79.42/55.17 

145.80/47.68 

51.01/49.15 

49.42/49.85 

350.31/56.95 

357.79/47.72 

71.16/48.95 

70.93/49.87 

1318.95/54.56 

Space Shuttle solid rocket booster (54 780 equations; 
Max. semibandwidth = 558; Av. semibandwidth = 355) 

VECTOR 

AUTO 

MICRO 

FORCE 

24.90/24.87 

12.38/29.17 

12.99/29.95 

40.17/51.98 

33.14/24.86 

16.64/28.28 

18.86/29.01 

62.54/50.07 

50.86/24.85 

25.74/27.88 

24.52/28.62 

100.71/54.78 


(a) Wall clock and CPU times for Choleski factorization 
for 10 runs each on three example problems. 


Wall clock time 

M Best 
□ Median 
Worst 



I i I i I 

0 1 2 


(b) Comparison of wall clock times (normalized). 


figure 15. CRA'Y -2 factorization times in multiuser mode with a maximum of four CPU’s. 


Method 

Best times: 

Median times: 

Worst times: 


Wall /CPU 

Wall /CPU 

Wall / CPU 


High-Speed Civil Transport aircraft (16 146 equations; 


Max. semibandwidth = 

594; Av. semibandwidth = 319) 


VECTOR 

5.60/5.58 

5.62/5.60 

5.84/5.59 


AUTO 

1.64/6.54 

1.66/6.55 

1.78/6.52 


MICRO 

1.59/6.33 

1.61/6.35 

2.76/6.28 . 


FORCE 

1.77/7.05 

1.80/7.12 

2.84/7.04 


Composite cross-ply laminate with hole (11 929 equations; 


Max, semibandwidth = 

[597; Av. semibandwidth = 1081) 


VECTOR 

65.32/47.15 

96.06/47.62 

120.87/47.39 


AUTO 

28.35/48.92 

38.88/49.13 

64.76/48.78 


MICRO 

25.56/48.52 

37.84 / 47.91 

67.94/48.06 


FORCE 

26.10/50.17 

31.74/49.94 

42.71/49.84 


| Space Shuttle solid rocket booster (54 780 equations; 


Max. semibandwidth = 

= 558; Av. semibandwidth = 355) 


VECTOR 

22.15/22.10 

34.49/22.45 

58.79/22.36 

1 

AUTO 

7.62 / 25.53 

13.72/25.54 

42.30/24.46 

■■■ 

MICRO 

7.38 / 25.08 

14.99/24.86 

19.76/24.40 


FORCE 

10.26/28.99 

14.73/28.38 

28.39/28.07 



(a) Wall clock and CPU times for Choleski factorization 
for 10 runs each on three example problems. 


L 


Wall clock time 

□ Median 
Worst 





0 .5 1.0 

(b) Comparison of wall clock times (normalized). 


Figure 16. CRAY Y-MP factorization times in multiuser mode with a maximum of four CPU’s. 
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Wall clock time 


Method 

Best times: 
Wall /CPU 

Median times: 
Wall /CPU 

Worst times: 
Wall /CPU 

High-Speed Civil Transport aircraft (16 146 equations; 
Max. semibandwidth - 594; Av. semibandwidth = 319) 

VECTOR 

AUTO 

MICRO 

FORCE 

5.60/5.58 

1.04/7.57 

1.03/7.31 

1.64/11.45 

5.62/5.60 

1.31/7.33 

1.49/7.04 

2.27/11.50 

5.84/5.59 

6.21/6.86 

14.42/6.60 

13.09/9.04 

Composite cross-ply laminate with hole (1 1 929 equations; 
Max. semibandwidth = 1597; Av. semibandwidth = 1081) 

VECTOR 

AUTO 

MICRO 

FORCE 

65.32/47.15 

28.46/50.00 

19.94/49.94 

19.50/54.01 

96.06/47.62 

34.62/49.61 

39.93/48.65 

26.55/53.81 

120.87/47.39 
93.09 /49.18 
49.86/48.36 
70.46/52.99 

Space Shuttle solid rocket booster (54 780 equations; 
Max. semibandwidth = 558; Av. semibandwidth = 355) 

VECTOR 

AUTO 

MICRO 

FORCE 

22.15/22.10 

6.92/27.07 

6.46/26.87 

10.33/38.45 

34.49/22.45 
11.32/27.16 
11.86/ 26.87 
18.31/33.91 

58.79/22.36 

29.66/25.22 

26.93/25.02 

31.93/38.20 


(a) Wall clock and CPU times for Choleski factorization 
for 10 runs each on three example problems. 


iH Best 
□ Median 
Worst 
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(b) Comparison of wall clock times (normalized). 


Figure 17. CRAY Y-MP factorization times in multiuser mode with a maximum of eight CPU’s. 


can request a maximum of eight CPU’s at run time. 
Figure 16 gives results from multiuser runs for all 
three example problems using the default maximum 
number of four CPU’s, and figure 17 gives results 
for the same problems using a maximum of eight 
CPU’s. Very little improvement in wall clock time 
is seen in figure 17 compared with the results in 
figure 16. This result indicates the difficulty of 
obtaining all eight processors in a busy multiuser 
environment. The results in both figures 16 and 17 
show a substantial reduction in wall clock time by all 
the PVBAND implementations. A major difference 
between the CRAY-2 and CRAY Y-MP results is 
that the CRAY Y-MP wall clock times are much 
better than the CRAY-2 wall clock times for the 
FORCE PVBAND implementations. This difference 
between the implementation of macrotasking on the 
two CRAY computers is reflected in a comparison of 
the wall clock times in figures 15, 16, and 17 for the 
FORCE implementations. 

Triangular Solutions 

Table 5 shows the CPU times for the solution of 
the triangular systems for the three structural anal- 


ysis problems used in this study. The improvement 
in computation rate is demonstrated for the HSCT 
problem by comparing the basic triangular solution 
times LL1S with the triangular solutions using loop 
unrolling to level 6, i.e., LL6S. This improvement is 
largest on the CRAY-2, where the bottleneck due to 
the single path to memory causes single SAXPY or 
inner product computations to be much slower than 
the same operations on the CRAY Y-MP. The re- 
duction in operations due to the zero-checking strat- 
egy LL6SZ is also shown in table 5 for the HSCT 
problem. For all three problems, the percentage of 
CPU time for the triangular solutions as compared 
with the factorization time is small. The potential 
benefit from parallelizing the triangular solutions is 
small because of the small percentage of total time 
required for the triangular solutions. For analyses 
requiring multiple triangular solutions, the benefit of 
a parallel triangular solution would be far greater. 

6. Concluding Remarks 

A Choleski method that effectively exploits key 
architectural features of the CRAY- 2 and CRAY 
Y-MP computers has been described. This method 
is well suited for parallel processing, and several 
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Table 5. Forward- and-Backward Solution Times for Three Example Problems 


Problem 

Method 0 

CPU time 
(second) 

Number of 
operations 

Rate, 

MFLOPS 

Factor tinned 
percent 

NASA Langley CRAY-2 computer runs 


LL1S 

0.27 

20610072 

76 

4.3 

HSCT 

LL6S 

.19 

20610072 

107 

3.1 


LL6SZ 

.17 

16210224 

94 

2.7 


LL1S 

0.45 

51583442 

114 

0.9 

Panel 

LL6S 

.37 

51583442 

138 

.8 


LL6SZ 

.38 

51583442 

136 

.8 


LL1S 

0.86 

77 979552 

91 

3.5 

SRB 

LL6S 

.71 

77979552 

110 

2.9 


LL6SZ 

.68 

77494 584 

105 

2.7 

NASA Ames CRAY Y-MP computer runs 


LL1S 

0.14 

20610072 

147 

2.5 

HSCT 

LL6S 

.14 

20610072 

147 

2.5 


LL6SZ 

.12 

16 210 224 

94 

2.2 


LL1S 

0.24 

51583442 

215 

0.5 

Panel 

LL6S 

.26 

51 583442 

198 

.6 


LL6SZ 

.25 

51583442 

206 

.5 


LL1S 

0.50 

77979552 

156 

2.3 

SRB 

LL6S 

.50 

77979552 

156 

2.3 


LL6SZ 

.49 

71494584 

146 

2.2 


°LL1S: forward-backward solution, no loop unrolling. 

LL6S: forward-backward solution, loop-unrolling level 6 
LL6SZ: forward-backward solution, loop- unrolling level 6 with zero checking. 
^Percentage calculation uses CPU time for VECTOR method. 


implementations of the PVBAND method have been 
presented comparing the use of macrotasking, auto- 
tasking, and microtasking. The differences between 
implementing the PVBAND method with a fixed 
number of tasks (i.e., as in using FORCE) and im- 
plementing the PVBAND method using processors 
as they become available at the time of program ex- 
ecution were discussed. 

Three structural analysis example problems were 
used to compare the parallel implementations of the 
PVBAND Choleski method. The example prob- 
lems were all generated using the COMET finite 
element software system and are representative of 
a wide class of large-scale problems for which the 
variable-band Choleski method is very efficient. The 
PVBAND implementations that use autotasking and 
microtasking were shown to be more effective on 
the CRAY multiprocessor supercomputers than the 
FORCE implementations that used macrotasking. 
Dedicated single-user runs on both the CRAY-2 
and CRAY Y-MP computers demonstrated maxi- 
mum computation rates of over 250 MFLOPS per 


processor and maximum rates of over 1 GIGAFLOP 
on the CRAY-2 using four processors and over 
2 GIGAFLOPS on the CRAY Y-MP using eight pro- 
cessors. Parallel speedups for the PVBAND imple- 
mentations compared favorably with maximum the- 
oretical speedups in all cases except for the FORCE 
implementations on the CRAY-2. Though com- 
putation rates and parallel speedups were slightly 
higher on the CRAY Y-MP than on the CRAY-2, 
the VBAND and PVBAND implementations are ef- 
ficient on both computers. 

The results presented for both dedicated and mul- 
tiuser environments indicate that parallel process- 
ing is effective in both cases in reducing wall clock 
time. The cost of parallel processing runs will al- 
ways be higher if based on total CPU (central pro- 
cessing unit) time, as indicated by increased CPU 
times for the PVBAND implementations compared 
with the VBAND implementations. The comparison 
of timing results for autotasking and microtasking 
implementations with the FORCE implementations 
demonstrates the importance of dynamically using 
available processors in a multiuser environment. 
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Appendix 

VBAND and PVBAND FORTRAN 
Listings 

In this appendix, FORTRAN listings of several 
subroutines used to implement the VBAND and 
PVBAND methods are given and briefly described. 
Figures A1 and A2 contain the listings for two sub- 
routines that are used to convert a symmetric, sparse 
matrix data structure to the variable-band matrix 
data structure used for the VBAND and PVBAND 
Choleski methods. Figures A3 and A 4 contain list- 
ings for the autotasked and FORCE implementa- 
tions of the PVBAND method, respectively. The 
two PVBAND implementations illustrate the two ap- 
proaches described in section 3 under the subsection 
“CRAY- 2 local memory considerations.” Figure A5 
contains a listing of the forward-and-backward solu- 
tions used in this study. Each figure includes line 
numbers with comments to explain major sections 
and key variables within each subroutine. A brief 
description of each subroutine follows. 

Subroutine sp2vb> shown in figure Al, accepts as 
input the sparse matrix pointer arrays and computes 
new pointer arrays for the variable-band data struc- 
ture. The sparse matrix pointer arrays contain the 
number of nonzero coefficients in each column of the 
lower triangular matrix as well as an integer row in- 
dex for each nonzero coefficient. The variable-band 
pointer arrays contain the lengths of each column in 
the lower triangular part of the variable-band ma- 
trix and the starting position of each column in a 
single-dimensioned array. In addition, the row in- 
dex values from the sparse data structure are con- 
verted to offsets that give the location of each nonzero 
coefficient from the input sparse matrix within the 
variable-band matrix array. The column lengths for 
the variable-band matrix data structure are initially 
computed in lines 56-61 in figure Al. The initial 
lengths are increased where necessary in lines 67 69 
to account for possible additional fill-in during factor- 
ization. A second adjustment is made to the column 
lengths in lines 77-92 to allow for loop unrolling. Af- 
ter adjustments to the column lengths, the starting 
position of each column within a single-dimensioned 
array is determined in lines 103 105. Finally, the 
row index pointers for each nonzero coefficient, stored 
in array m(), are changed to location pointers in 
lines 113-118. These location pointers give the po- 
sition of each nonzero coefficient within the single- 
dimensioned variable- band matrix array. Subrou- 
tine sp2vb also determines the amount of memory 
required for the variable-band matrix array and re- 


turns that value along with the maximum bandwidth 
and average bandwidth as arguments. 

Subroutine convert , shown in figure A2, is called 
after subroutine sp2vb to build the single-dimensioned 
array used to store the variable-band matrix. The 
array is initially filled with zeros in lines 29 and 30 
in figure A2. Then, in lines 34-41, the nonzero co- 
efficients from the sparse matrix storage array are 
assigned to the variable-band matrix array as deter- 
mined by the locations stored in array ia(). Though 
initially many zeros are stored in the variable-band 
array, many are changed to nonzero values during 
matrix factorization. The use of two subroutines al- 
lows for efficient dynamic memory allocation of the 
variable-band matrix array after the exact length of 
the array is determined by subroutine sp2vb. 

Subroutine pvband , shown in figure A3, is the au- 
totasked version of method PVBAND used to ob- 
tain the results presented in this paper. The sub- 
routine uses loop unrolling, local memory (on the 
CRAY-2), and zero checking as described in section 3 
along with autotasking compiler directives to imple- 
ment a fast, efficient parallel /vector version of the 
PVBAND method. The FORTRAN listing of the se- 
quential vectorized version of the VBAND method is 
not given in this appendix, but it is easily obtained 
from the listing in figure A3 by deleting both the 
parallel directives and the sections of code used by 
additional processes to copy completed columns into 
local memory. For example, lines 53-61 copy a com- 
pleted column of the variable-band matrix into local 
memory and are not required for the sequential FOR- 
TRAN version. Variable idone is used to facilitate 
the copying of columns into local memory in the au- 
totasking implementation and is not required in the 
sequential FORTRAN version. In figure A3, lines 33 
244 correspond to the barrier region in the PVBAND 
method described in section 3. Each of six columns, 
k through k + 5, are completed within an autotask- 
ing control structure ( case directive). This sequen- 
tial portion of the algorithm accounts for a very small 
percentage of the total computations. (See table 4.) 
Lines 253-284 comprise the parallel portion of the 
PVBAND method. The outer loop iterations of DO 
loop 61 (line 254) are completed in parallel, each per- 
forming a column update using the six-term SAXPY 
shown in lines 274-279. The zero-checking option 
(lines 257-271) skips the SAXPY update whenever 
all six scalar multipliers in DO loop 610 are all equal 
to zero. 

The FORCE subroutine F or cesub pvband, shown 
in figure A4, illustrates the use of the FORCE lan- 
guage syntax to implement the PVBAND method. 
The variable NPROC in line 1 of figure A4 is the 
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fixed number of tasks, determined at run time, that 
executes the parallel subroutine. Variable ME, also 
in line 1, is the task identification variable and will 
have a unique value between 1 and NPROC. In 
the FORCE variable declaration portion ending with 
line 13, all variables used within subroutine pvband 
are declared private and all argument variables are 
declared as shared or private in the calling program. 
The sequential portion of the PVBAND method in 
lines 40 159 is computed within a single FORCE bar- 
rier. The six columns computed within the barrier 
region by only one task are copied into the local mem- 
ory of all remaining tasks in lines 162-179. The par- 
allel update portion in lines 183-222 is implemented 


by the FORCE Presetted DO statement in line 190. 
The FORCE subroutine in figure A4 will run as is on 
any parallel computer for which FORCE is installed. 

Figure A5 contains the FORTRAN listing of the 
forward and backward solutions used for the VBAND 
and PVBAND methods. The subroutine in figure A5 
uses loop unrolling and zero checking. The main 
computation in the forward solution (lines 17-130) 
is the six-term SAXPY loop (lines 76-84). The 
main computations in the backward solution (lines 
132-253) are the six inner product computations 
(lines 217-224). The remaining computations are 
sequential computations. 
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1 : subrout ine sp2vb (spptr , splen , ia , n , nroll , vbptr , vblen , 

2: + ibw,iavbw,ialth) 

3: 

4: integer spptr (*) , splen (*) , iaO) , row, colm 

5: integer vbptr (*) , vblen (*) 

6: 

7: c *** This routine is called to create pointer arrays which are 
8: c *** subsequently used by routine convert to convert an input 
9: c *** matrix stored in a sparse data structure to a variable-band 
10: c *** matrix data structure. Major steps are: 1) Create arrays vblen and 
11: c *** vbptr, 2) Change row index values stored for each lower triangular 
12: c *** coefficient to location pointers for each coefficient. 

13: c *** 

14: c *** INPUT ARGUMENTS : 

15: c *** spptr () - starting position for each column of sparse 

16: c *** matrix within a single-dimensioned array 

17: c *** splenO - number of nonzero coefficients in each column of 

18: c *** sparse matrix 

19 : c *** ia() - row index for each nonzero below the main diagonal 
20: c *** in input sparse matrix. The row index values are 

21 : c *** arranged in the same order as the nonzero 

22: c *** coefficients; by columns. 

23: c *** n - number of equations in the sparse matrix 

24: c *** nroll - level of loop unrolling used in the factorization 

25: c *** routine (nroll > 1 requires adjustments to column 

26: c *** lengths) 

27: c *** 

28: c *** OUTPUT ARGUMENTS : 

29: c *** ia() - return values in this array contain an offset for 
30: c *** each nonzero coefficient in sparse matrix which 

31 : c *** will be used to expand the sparse storage into 

32; c *** variable-band storage. 

33: c *** vbbtrQ - starting position for each column of variable-band 

34: c *** matrix within a single-dimensioned array 

35: c *** vblen() - length of each column of variable-band matrix; 

36: c *** includes diagonal and allows for zeros down to the 

37 : c *** last nonzero coefficient in each column plus 

38: c *** possible adjustments for fill past the last nonzero 

39: c *** and loop-unrolling adjustments. 

40: c *** ibw - maximum semibandwidth (includes diagonal) after 
41: c *** all adjustments. 

42: c *** iavbw - average semibandwidth (total storage required 

43: c *** divided by number of equations) 

44: c *** ialth - total memory required for variable-band matrix 

45: c *** coefficient array, includes diagonal 

46: 

Figure Al. FORTRAN listing for subroutine sp2bv. 
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47: 

c 

*** 

48: 

c 

*** 

49: 

c 

*** 

50: 

c 

*** 

51: 

c 

*** 

52: 

c 

*** 

53: 

c 

*** 

54: 

c 

*** 

55: 



56: 



57: 



58: 



59: 



60: 



61: 

1 

62: 



63: 

c 

*** 

64: 

c 

*** 

65: 

c 

*** 

66: 



67: 



68: 



69: 

2 


70: 



71: 

c 

*** 

72: 

c 

*** 

73: 

c 

* * * 

74: 

c 

*** 

75: 

c 

*** 

76: 



77: 



78: 



79: 



80: 



81: 



82: 



83: 

30 

84: 



85: 

c 

*** 

86: 

c 

*** 

87: 



88: 



89: 



90: 



91: 

31 

92: 




Compute the lengths of each variable-band column 
from the main diagonal to the last nonzero coefficient 
in the lower triangular part of the matrix. The main 
diagonal is part of the column length. 

Loop 1 assumes that nonzero coefficients within each 
column of sparse input data structure are ordered by increasing 
row indexes. Pointer irx points to the last nonzero row index 
in each column. 

vblen(n)=l 
do 1 colm=l,n-l 

irx=spptr ( colm+ 1 ) - 1 
row=ia(irx) 

vblen ( colm) =row-colm+l 
continue 

Adjust column lengths for possible fill during factorization; 
each column must extend at least to the last row 
of the previous column. 

do 2 colm=2,n 

vblen (colm) =max (vblen (colm- 1) - 1 , vblen (colm) ) 
continue 

Adjust for loop unrolling if nroll > 1 ; 
groups of nroll columns must end in the same row. 

Zeros are effectively added to the ends of columns 
as required to enforce this condition by increasing 
the length of the column (vblen(colm) ) . 

if (nroll. gt.l) then 
last c=n-mod (n , nroll) 
do 30 colm=nroll , lastc , nroll 
lth=vblen(colm) 
do 30 i=l,nroll-l 
vblen ( colm- i)=lth+i 
continue 

Fix up the end columns whenever n mod nroll <> 0 
The remainder columns are full . 


iadd=2 

do 31 row=n-l,lastc+l,-l 
vblen(row)=iadd 
iadd=iadd+l 
end if 


Figure Al. Continued. 
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93: 

94: 

95: 

96: 

97: 

98: 

99: 

100 : 

101 : 

102 : 

103: 

104: 

105: 

106: 

107: 

108: 

109: 

110 : 

111 : 

112 : 

113: 

114: 

115: 

116: 

117: 

118: 

119: 

120 : 

121 : 


c *** Add up the column lengths and form vbptr array; 
c *** also compute maximum semibandwidth (including diagonal). 

ialth=0 

ibw=0 

do 40 colm=l,n 

ibw=max(ibw,vblen(colm) ) 

40 ialth=ialth+vblen(colm) 

vbptr (1)=1 
do 41 colm=2,n 

41 vbptr (colm) =vbptr (colm- l)+vblen (colm- 1) 
iavbw=ialth/n 

c *** Change the row index values from 

c *** the sparse column storage in ia() to the 

c *** locations where each nonzero coefficient is stored 

c *** within the variable-band matrix single-dimensioned array. 

do 51 colm=l,n-l 

do 51 k=spptr (colm) ,spptr (colm+l)-l 
row = ia(k) 

loc * vbptr (colm) + row - colm 
ia(k) = loc 
51 continue 

return 

end 


Figure Al. Concluded. 


1 : 

2 : 

3: 


4: 



5: 



6: 

c 

*** 

7: 

c 

*** 

8: 

c 

*** 

9: 

c 

*** 

10: 

c 

*** 

11: 

c 

*** 

12: 

c 

*** 

13: 

c 

*** 

14: 

c 

*** 

15: 

c 

*** 

16: 

c 

*** 

17: 

c 

*** 

18: 

c 

*** 

19: 

c 


20: 

c 

*** 

21: 

c 

*** 

22: 

c 

*** 

23: 

c 

*** 

24: 

c 

*** 

JO 

cn 

c 

*** 

26: 

c 

*** 

27: 



28: 

c 

*** 

29: 



30: 

10 

31: 



32: 

c 

*** 

33: 

c 

*** 

34: 



35: 

20 

36: 



37: 

c 

*** 

38: 

c 

*** 

39: 



40: 



41: 

30 

42: 



43: 



44: 




subroutine convert (coefs ,diag, ia, vbptr ,ncoef ,ialth,n, a) 

integer ia(*) ,row,colm 

integer vbptr (*) 

real a(*) , coef s(*) ,diag(*) 

This routine converts sparse data structure into variable-band 
data structure. Routine sp2vb must be called first to 
create input array vbptr () and modify array ia(*) . 

INPUT ARGUMENTS: 

coefs () - single-dimensioned array containing nonzero 

coefficients of sparse matrix, arranged by columns, 
lower triangular coefficients only 
diag() - main diagonal of sparse input matrix 

iaO - for each coefficient in array coefsO contains an offset 

into variable-band matrix array, a() 
vbptr () - starting position of each column in the variable-band 
matrix coefficient array a() . 
ncoef - number of nonzero in array coefsO 

ialth - length of the variable-band array 

n - number of equations in sparse and variable-band matrices 

OUTPUT ARGUMENTS: 

a() - single-dimensioned array containing the coefficients 

of the variable-band output matrix including zeros 
added within columns as appropriate. 

Zero out space for vband storage of coefficients, 
do 10 i=l, ialth 
a(i)=0 . 0 

Assign nonzero coefficients to locations in variable-band 
matrix array as designated by array ia() . 
do 20 i*l, ncoef 
a(ia(i))=coefs(i) 

Place main diagonal coefficients at beginning of each column 
in variable-band array, 
do 30 colm=l,n 

a (vbptr (colm) )=diag(colm) 
continue 

return 

end 


Figure A2. FORTRAN listing for subroutine convert. 
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1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 


subroutine pvband (a, alth, colptr , eolith, n,nopsf) 

integer k,i, j ,lastr ,lth,tl ,iadd 

integer ist 1 , ist2 , ist3 , ist4 , ist5 , ist6 

integer il , i2 , i3, i4, i5 , i6 

integer n,alth,nopsf 

integer colptr (n) , collth(n) 

real a(alth) 

parameter (lmsize=9600) 
parameter (16=lmsize/6) 

real ml(16) ,m2(16) ,m3(16) ,m4(16) ,m5(16) ,m6(16) 
common /lmem/ idone ,ml ,m2 ,m3,m4,m5 ,m6 

cdir$ regfile lmem 

cmic$ parallel sharedCa, alth, colptr , eolith, n,nopsf) private (k, i ,j , 
cmic$l lastr ,lth,tl , iadd,istl ,ist2,ist3,ist4,ist5,ist6, 
cmic$2 il,i2,i3,i4,i5,i6, idone, ml ,m2,m3,m4,m5,m6,ict ,itst) 

c *** Variable idone is a private variable and is used to 
c *** determine which processors must copy multipliers into 
c *** local memory. The processor which computes a multiplier 
c *** already has that column in local memory and proceeds 
c *** to the next code section, skipping the copy segment. 
idone=0 
k=l 

c *** Counting variable nopsf is shared and must be zeroed prior 

c *** to calling this routine. The variable is used to count 

c *** only serial operations. 

999 continue 

c *** Begin jki portion: Complete 6 columns of L; k through k+5 . 
c *** The 5 columns are used as multiplier columns in the 
c *** update section. 

c *** Finish the kth multiplier. 

istl=colptr (k) 
il=istl+l 

cmic$ case 

a(istl) = l . 0/sqrt (a(istl) ) 
tl = l 
cdir$ ivdep 

Figure A3. FORTRAN listing for autotasked subroutine pvband . 
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45: do 10 i=il , il+collth(k)-2 

46: a(i)=a(i)*a(istl) 

47: ml(tl)=a(i) 

48: tl-tl+1 

49: 10 continue 

50: idone=k 

51: cmic$ end case 
52: 

53: c *** Everyone else copy the kth column into local memory. 
54: if (idone.lt.k) then 

55: tl-1 

56: do 111 i=il , il+collth(k)-2 

57: ml(tl)-a(i) 

58: tl-tl+1 

59: 111 continue 

60: idone=k 

61: end if 

62: 

63: c *** Update, then finish column k+1. 

64 : ist2-colptr (k+1) 

65: i2=ist2+l 

66 : 

67: cmic$ case 
68: tl-1 

69: ict-0 

70: cdir$ ivdep 

71: do 11 i=ist2,ist2+collth(k+l)-l 

72: a(i)=a(i) -miCtl) *ml (tl+ict) 

73: ict=ict+l 

74: 11 continue 

75: 

76: c *** Finish the (k+l)st multiplier. 

77 : a(ist2)=l .0/sqrt (a(ist2) ) 

78: tl=2 

79: cdir$ ivdep 

80: do 20 i=i2 , i2+collth(k+l)-2 

81 : a(i)=a(i)*a(ist2) 

82: m2(tl)=a(i) 

83: tl-tl+1 

84: 20 continue 

85: idone=k+l 

86: cmic$ end case 


Figure A3. Continued. 
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87 

88 

89 

90 

91 

92 

93 

94 

95 

96 

97 

98 

99 
100 
101 
102 

103 

104 

105 

106 

107 

108 

109 

110 
111 
112 

113 

114 

115 

116 

117 

118 

119 

120 
121 
122 

123 

124 

125 

126 

127 

128 

129 

130 

131 

132 

133 


c *** Everyone else copy column k+1 into local memory, 
if (idone . It .k+1) then 
tl=2 

do 211 i=i2 , i2+collth(k+l) -2 
m2(tl) =a(i) 
tl=tl+l 
211 continue 
idone=k+l 
end if 

c *** Update, then finish column k+2. 
ist3=colptr (k+2) 
i3=ist3+l 

cmic$ case 
1 1=2 
ict=0 
cdir$ ivdep 

do 21 i=ist3 , ist3+collth(k+2)-l 

a(i)=a(i)-ml (tl) *ml (tl+ict) -m2(tl) *m2(tl+ict) 
ict=ict+l 
21 continue 

c *** Finish the (k+2)nd multiplier. 
a(ist3)=l . 0/sqrt (a(ist3) ) 
tl=3 
cdir$ ivdep 

do 30 i=i3, i3+collth(k+2) -2 
a(i)=a(i)*a(ist3) 
m3(tl)=a(i) 
tl=tl+l 
30 continue 

idone=k+2 
cmic$ end case 

c *** Everyone else copy the column k+2 into local memory, 
if (idone . It .k+2) then 
1 1=3 

do 311 i=i3 , i3+collth(k+2) -2 
m3(tl)=a(i) 
tl=tl+l 
311 continue 
idone=k+2 
endif 

c *** Update, then finish column k+3. 


Figure A3. Continued. 
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135 

136 

137 

138 

139 

140 

141 

142 

143 

144 

145 

146 

147 

148 

149 

150 

151 

152 

153 

154 

155 

156 

157 

158 

159 

160 
161 
162 

163 

164 

165 

166 

167 

168 

169 

170 

171 

172 

173 

174 

175 

176 

177 

178 

179 

180 
181 


ist4=colptr (k+3) 
i4=ist4+l 

cmic$ case 
1 1=3 
ict=0 
cdir$ ivdep 

do 31 i=ist4,ist4+collth(k+3)-l 

a(i)=a(i)-ml (tl)*ml (t 1+ict ) -m2 (t 1 )*m2 (tl+ict) -m3 (tl)*m3 (tl+ict) 
ict-ict+1 
31 continue 

c *** Finish the (k+3)rd multiplier. 
a(ist4)=l . 0/sqrt (a(ist4) ) 

1 1=4 
cdir$ ivdep 

do 40 i=i4,i4+collth(k+3)-2 
a(i)=a(i)*a(ist4) 
m4(tl)=a(i) 
tl=tl+l 

40 continue 
idone=k+3 

cmic$ end case 

c *** Everyone else copy the column k+3 into local memory, 
if (idone.lt .k+3) then 
tl=4 

do 411 i=i4, i4+collth(k+3)-2 
m4(tl)=a(i) 
tl=tl+l 
411 continue 
idone=k+3 
endif 

c *** Update, then finish column k+4. 
ist5=colptr (k+4) 
i5=ist5+l 

cmic$ case 
tl=4 
ict=0 
cdir$ ivdep 

do 41 i=ist5,ist5+collth(k+4)-l 

a(i)=a(i)-ml (tl)*ml (tl+ict) -m2(tl) *m2 (t 1+ict ) -m3 (tl)*m3 (tl+ict) 
+ -m4(tl)*m4 (tl+ict) 

ict=ict+l 

41 continue 


Figure A3. Continued. 
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182: 

c *** 

Finish the (k+4)th multiplier. 

183: 


a(ist5)=l . 0/sqrt (a(ist5) ) 

184: 


1 1=5 

185: 

cdir$ ivdep 

186: 


do 50 i=i5,i5+collth(k+4)-2 

187: 


a(i)=a(i) *a(ist5) 

188: 


m5(t l)=a(i) 

189: 


tl-ti+1 

190: 

50 

continue 

191: 


idone=k+4 

192: 

cmic$ 

end case 

193: 



194: 

c *** 

Everyone else copy the column k+4 into local memory. 

195: 


if (idone . It .k+4) then 

196: 


tl=5 

197: 


do 511 i=i5 , i5+collth(k+4)-2 

198: 


m5(tl)=a(i) 

199: 


tl=tl+l 

200: 

511 

continue 

201: 


idone=k+4 

202: 


endif 

203: 



204: 

c *** 

Update, then finish column k+5. 

205: 


ist6=colptr (k+5) 

206: 


i6=ist6 

207: 



208: 

cmic$ 

case 

209: 


tl=5 

210: 


ict=0 

211: 

cdir$ ivdep 

212: 


do 51 i=ist6,ist6+collth(k+5)-l 

213: 


a(i)=a(i)-ml (tl)*ml (tl+ict)-m2(tl) *m2(tl+ict)-m3(tl)*m3(tl+ict) 

214: 

+ 

-m4(tl) *m4(tl+ict) -m5(tl) *m5(tl+ict) 

215: 


ict=ict+l 

216: 

51 

continue 

217: 



218: 

c *** 

Finish the (k+5)th multiplier. 

219: 


a(ist6)=l. 0/sqrt (a(ist6)) 

220: 


tl=6 

221: 

cdir$ ivdep 

222: 


do 60 i=ist6+l,ist6+collth(k+5)-l 

223: 


a(i)=a(i) *a(ist6) 

224: 


m6(tl)=a(i) 

225: 


tl=tl+i 

226: 

60 

continue 


Figure A3. Continued. 
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227: 

228: 

229: 

230: 

231: 

232: 
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234: 
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241: 

242: 

243: 

244: 
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246: 

247: 
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249: 

250: 

251: 

252: 

253: 

254: 

255: 

256: 

257: 

258: 

259: 

260: 

261: 

262: 

263: 

264: 

265: 

266: 

267: 

268: 

269: 

270: 

271: 

272: 


c *** Count all floating point vector operations for finishing 
c *** this group of 6 multiplier columns. 

nopsf =nopsf+collth(k+5) *36 + 55 
cmic$ end case 

c *** Everyone else copy column k+5 into local memory, 
if (idone . It .k+5) then 
1 1=6 

do 6111 i=ist6+l , ist6+collth(k+5)-l 
m6(tl)=a(i) 
tl=tl+l 
6111 continue 
idone=k+5 
endif 

c *** End jki portion: 6 columns of L completed. 

c *** Begin kji portion: update columns k+6 thru lastr 
c *** 0 f matrix using the 6 completed multiplier columns. 

c **+ Variable lastr corresponds to the last row in column k 
c *** that is stored in the variable-band format. 
lastr=k+collth(k) -1 

cmic$ do parallel 

do 61 j=k+6, lastr 
tl-j-k 
itst=tl 

c *** Begin full zero checking 
if (ml (itst) . eq. 0 . 0) then 
if (m2(itst) .eq.0.0) then 
if (m3(itst) . eq. 0 . 0) then 
if (m4(itst) . eq. 0 .0) then 
if (m5(itst) . eq. 0 . 0) then 
if (m6(itst) .eq.0.0) then 
goto 611 
endif 
endif 
endif 
endif 
endif 
endif 

c *** End zero checking. 
lth=lastr-j 


Figure A3. Continued. 
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cdir$ ivdep 

do 610 i=colptr(j) ,colptr( j)+lth 

a(i)=a(i) -ml (itst) *ml (tl) -m2(itst)*m2(tl) 

+ -m3(itst)*m3(tl)-m4(itst) *m4(tl) 

+ -mS(itst) *m5(tl)-m6(itst) *m6(tl) 

tl=tl+l 

610 continue 

611 continue 
61 continue 

c *** End kji portion: increment k and begin next jki portion 
c *** if k is less than n-6. 

k=k+6 

if (k. It. n-6) goto 999 

c *** Finish any remaining columns of L. There are up to 7 
c *** columns remaining. The last 7 columns are explicitly 
c *** computed without loop overhead, etc. 

cmic$ case 

goto (100,110,120,130,140,150,160) , (n-k+1) 

160 continue 

c *** 6 columns left to update 

ist l=colptr (k) 
a(istl)=l .0/sqrt (a(istl) ) 
a(istl+l)=a(ist 1+1) *a(istl) 
a ( ist 1+2) =a( ist 1+2) *a(istl) 
a(istl+3)=a(ist 1+3) *a(istl) 
a ( ist 1+4) =a( ist 1+4) *a(istl) 
a (ist 1+5) =a( ist 1+5) *a(isti) 
a ( ist 1+6) =a (ist 1+6) *a(istl) 
ist2=colptr (k+1) 

a ( ist 2) =a( ist 2) -a (ist 1+1) *a(istl+l) 
a(ist2+l)=a(ist2+l) -a (ist l+l)*a(istl+2) 
a(ist2+2)=a(ist2+2)-a(istl+l)*a(istl+3) 
a ( ist 2+3) =a ( ist 2+3) -a (ist 1+1 )*a (ist 1+4) 
a(ist2+4)=a(ist2+4)-a(istl+l)*a(istl+5) 
a ( ist 2+5) =a (ist 2+5) -a(istl+l)*a(istl+6) 
ist3=colptr (k+2) 

a(ist3)=a(ist3)-a(istl+2) *a (ist 1+2) 
a(ist3+l)=a(ist3+l)-a(istl+2)*a(istl+3) 
a(ist3+2)=a(ist3+2)-a(istl+2)*a(istl+4) 
a(ist3+3)=a(ist3+3)-a(istl+2)*a(istl+5) 
a(ist3+4)=a(ist3+4)-a(istl+2) *a (ist 1+6) 


Figure A3. Continued. 
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ist4=colptr (k+3) 

a(ist4)=a(ist4) -a (1st 1+3) +a(istl+3) 
a(ist4+l)=a(ist4+l)-a(istl+3)*a(istl+4) 
a(ist4+2)=a(ist4+2)-a(isti+3)*a(istl+5) 
a(ist4+3)=a(ist4+3)-a(istl+3)*a(istl+6) 
ist5=colptr (k+4) 

a(ist5)=a(ist5)-a(istl+4)*a(istl+4) 
a(ist5+l)=a(ist5+l)-a(istl+4) *a(istl+5) 
a(ist5+2)=a(ist5+2)-a(istl+4) *a(istl+6) 
ist6=colptr (k+5) 

a(ist6)=a(ist6)-a(ist 1+5) *a(istl+5) 
a(ist6+l)=a(ist6+l)-a(istl+5) *a(istl+6) 
a(colptr (k+6))=a(colptr(k+6))-a(istl+6) *a(istl+6) 
k=k+l 

nopsf=nopsf +49 
150 continue 

c *** 5 columns left to update 

istl=colptr (k) 
a(istl)=l. 0/sqrt (a(istl) ) 
a(istl+l)=a(istl+l)*a(istl) 
a(ist l+2)=a(istl+2) *a(istl) 
a(ist l+3)=a(istl+3) *a(istl) 
a(ist l+4)=a(istl+4) *a(istl) 
a(istl+5)=a(istl+5) *a(istl) 
ist2=colptr (k+1) 

a(ist2)=a(ist2)-a(istl+l) *a(istl+l) 
a(ist2+l)=a(ist2+l)-a(istl+l) *a(istl+2) 
a(ist2+2)=a(ist2+2)-a(istl+l) *a(istl+3) 
a(ist2+3)=a(ist2+3)-a(istl+l) *a(istl+4) 
a(ist2+4)=a(ist2+4)-a(istl+l) *a(istl+5) 
ist3=colptr (k+2) 

a(ist3)=a(ist3) -a(ist 1+2) *a(istl+2) 
a(ist3+l) : =a(ist3+l)-a(istl+2) *a(istl+3) 
a(ist3+2)=a(ist3+2)-a(istl+2) *a(istl+4) 
a(ist3+3)=a(ist3+3)-a(istl+2)*a(istl+5) 
ist4=colptr (k+3) 

a(ist4)=a(ist4) -a(istl+3) *a (1st 1+3) 
a(ist4+l)=a(ist4+l) -a(ist 1+3) *a(istl+4) 
a(ist4+2)=a(ist4+2)-a(istl+3) *a(istl+5) 
ist5=colptr(k+4) 

a(ist5)=a(ist5)-a(istl+4)*a(istl+4) 
a(ist5+l)=a(ist5+l)-a(istl+4) *a(istl+5) 
a(colptr (k+5) )=a(colptr (k+5) ) -a(istl+5)*a(istl+5) 
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k=k+l 

nopsf =nopsf+36 
140 continue 

c *** 4 columns left to update 

istl=colptr (k) 
a(istl)=l . 0 /sqrt (a(istl) ) 
a(istl+l)=a(istl+l)*a(istl) 
a(istl+ 2 )=a(istl+ 2 ) *a(istl) 
a(istl+3)=a(istl+3) *a(istl) 
a(istl+4)=a(istl+4) *a(istl) 
ist 2 =colptr (k+ 1 ) 

a(ist 2 )=a(ist 2 )-a(istl+l)*a(istl+l) 
a(ist 2 +l)=a(ist 2 +l)-a(istl+l) *a(istl+ 2 ) 
a(ist 2 + 2 )=a(ist 2 + 2 )-a(istl+l) *a(istl+3) 
a(ist2+3)=a(ist2+3)-a(istl+l) *a(istl+4) 
ist3=colptr (k+2) 

a(ist3)=a(ist3)-a(istl+2) *a Cist 1+2) 
a(ist3+l)=a(ist3+l)-a(istl+2) *a(istl+3) 
a(ist3+2)=a(ist3+2)-a(istl+2) *a(istl+4) 
ist4=colptr (k+3) 

a(ist4)=a(ist4) -a(istl+3) *a(istl+3) 
a(ist4+l)=a(ist4+l) -a(istl+3) *a(istl+4) 
a(colptr (k+4) )=a(colptr (k+4) ) -a(ist 1+4) *a(istl+4) 
k=k+l 

nopsf =nopsf +25 
130 continue 

c *** 3 columns left to update 

istl=colptr (k) 
a(istl)=l. 0 /sqrt (a (istl)) 
a(ist 1 + 1 ) =a(ist 1 + 1 ) *a Cist 1 ) 
a(ist 1 + 2 ) =a(ist 1 + 2 ) *a (istl) 
a(istl+3)=a(istl+3)*a(istl) 
ist 2 =colptr (k+ 1 ) 

a(ist 2 )=a(ist 2 )-a(istl+l) *a(istl+l) 
a(ist 2 +l)=a(ist 2 +l)“a(istl+l)*a(istl+ 2 ) 
a(ist 2 + 2 )=a(ist 2 + 2 )-a(istl+l) *a(istl+3) 
ist3=colptr (k+2) 

a(ist 3 )=a(ist 3 )-a(istl+ 2 ) *a(istl+ 2 ) 
a(ist3+l)=a(ist3+l)-a(istl+2)*a(istl+3) 
aCcolptr (k+3) )=a(colptr (k+3) ) -a(istl+3) *a(ist 1+3) 
k=k+l 

nopsf =nopsf +16 
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120 continue 

c *** 2 columns left to update 

istl=colptr (k) 
a(istl)=l . 0/sqrt (a(istl)) 
a(istl+l)=a(istl+l)*a(istl) 
a(istl+2)=a(istl+2) *a(ist 1) 
ist2=colptr (k+1) 

a(ist2)=a(ist2) -a(istl+l) *a(istl+l) 
a(ist2+l)-a(ist2+l)-a(istl+l)*a(istl+2) 
a(colptr (k+2) ) =a(colptr (k+2) ) -a(ist 1+2) *a(istl+2) 
k=k+l 

nopsf =nopsf+9 
110 continue 

c *** l column left to update 
istl=colptr Ck) 
a (istl)=l .0/sqrt (a(istl)) 
a(istl+l)=a(istl+l)*a(istl) 
ist2=colptr (k+1) 

a(ist2) =a(ist2) -a(ist 1+1) *a(ist 1+1) 
nopsf=nopsf+4 

100 continue 

c *** 0 columns left to update 

a(colptr (n) ) =1 . 0/sqrt (a(colptr (n) ) ) 
nopsf =nopsf+l 
cmic$ end case 
cmic$ end parallel 

return 

end 


Figure A3. Concluded. 


44 


1: 

2 : 

3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 

14: 

15: 

16: 

17: 

18: 

19: 

20: 

21 : 

22: 

23: 

24: 

25: 

26: 

27: 

28: 

29: 

30: 

31: 

32: 

33: 

34: 

35: 

36: 

37: 

38: 

39: 

40: 

41: 

42: 

43: 

44: 

45: 

46: 

47: 


Forcesub pvband (a, alth, colptr , eolith, n, nopsf ) of NPROC ident ME 

Private integer k,i, j ,lastr , lth,tl , iadd 

Private integer istl ,ist2,ist3,ist4,ist5,ist6 

Private integer il,i2,i3, i4, i5,i6, ict ,itst 

integer n,alth,nopsf 

integer colptr (n) , eolith (n) 

real a(alth) 

parameter (lmsize=9600) 
parameter (16=lmsize/6) 

Private real ml (16) ,m2(16) ,m3(16) ,m4(16) ,m5(16) ,m6(16) 
common /lmem/ idone ,ml ,m2 ,m3 ,m4 ,m5,m6 
End declarations 

cdir$ regfile lmem 


c *** Variable idone is a private variable and is used to 
c *** determine which processors must copy multipliers into 
c *** local memory. The processor which computes a multiplier 
c *** already has that column in local memory and proceeds 
c *** to the next code section, skipping the copy segment. 
idone=0 
k=l 

nopsf =0 

c *** Counting variable nopsf is private and is used to count 
c *** the number of operations used by each process. 

999 continue 

c *** Begin jki portion: Complete 6 columns of L; k through k+5. 
c *** The 6 columns are used as multiplier columns in the 
c *** update section. 

c *** Finish the kth multiplier. 
istl=colptr (k) 
il=ist 1+1 

Barrier 

a(istl)=l . 0/sqrt (a (istl) ) 
tl=l 
cdir$ ivdep 

do 10 i=il , il+collth(k)-2 
a(i)=a(i) *a(istl) 
ml (tl)=a(i) 
tl=tl+l 
10 continue 

Figure A4. FORTRAN listing for FORCE subroutine pvband. 
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c *** Update, then finish column k+1. 
ist2=colptr(k+l) 
i2=ist2+l 
tl=l 
ict=0 
cdir$ ivdep 

do 11 i=ist2 , ist2+collth(k+l)-l 
a(i)=a(i) -ml (t 1) *ml (tl+ict) 
ict=ict+l 
11 continue 

c *** Finish the (k+l)st multiplier. 
a(ist2)=l .0/sqrt (a(ist2) ) 
tl=2 
cdir$ ivdep 

do 20 i=i2,i2+collth(k+l)-2 
a(i)=a(i) *a(ist2) 
m2(tl)=a(i) 
tl=tl+l 

20 continue 

c *** Update, then finish column k+2. 
ist3=colptr(k+2) 
i3=ist3+l 
1 1=2 
ict=0 
cdir$ ivdep 

do 21 i=ist3 , ist3+collth(k+2) -1 

a(i)=a(i)-ml (tl) *ml (tl+ict) -m2 (tl) *m2 (tl+ict) 
ict=ict+l 

21 continue 

c *** Finish the (k+2)nd multiplier. 
a(ist3)=l .0/sqrt (a(ist3) ) 

1 1=3 
cdir$ ivdep 

do 30 i=i3,i3+collth(k+2)-2 
a(i)=a(i) *a(ist3) 
m3(tl)=a(i) 
tl=tl+l 
30 continue 

c *** Update, then finish column k+3. 
ist4=colptr (k+3) 
i4=ist4+i 
1 1=3 
ict=0 


Figure A4. Continued. 
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cdir$ ivdep 

do 31 i=ist4,ist4+collth(k+3)-l 

a(i)=a(i) -ml (tl)*ml (tl+ict)-m2(tl) *m2(tl+ict)-m3(tl)*m3(tl+ict) 
ict=ict+l 
31 continue 

c *** Finish the (k+3)rd multiplier, 
a(ist4)=l . 0/sqrt (a(ist4) ) 
tl=4 

cdir\$ ivdep 

do 40 i=i4,i4+collth(k+3)-2 
a(i)=a(i) *a(ist4) 
m4(tl)=a(i) 
tl=tl+l 

40 continue 

c *** Update, then finish column k+4. 
ist5=colptr (k+4) 
i5=ist5+l 
1 1=4 
ict=0 
cdir\$ ivdep 

do 41 i=ist5, ist5+collth(k+4)-l 

a(i)=a(i)-ml (t 1) *ml (tl+ict) -m2(tl) *m2 (tl+ict) -m3 (tl) *m3 (tl+ict) 
+ -m4(tl) *m4 (tl+ict) 

ict=ict+l 

41 continue 

c *** Finish the (k+4)th multiplier, 
a(ist5)=l . 0/sqrt (a (ist5) ) 
tl=5 

cdir\$ ivdep 

do 50 i=i5,i5+collth(k+4)-2 
a(i)=a(i) *a(ist5) 
m5(tl)=a(i) 
tl=tl+l 
50 continue 

c *** Update, then finish column k+5. 
ist6=colptr (k+5) 
i6=ist6 
1 1=5 
ict=0 
cdir\$ ivdep 

do 51 i=ist6 , ist6+collth(k+5) -1 

a(i)=a(i)-ml (t l)*ml (tl+ict) -m2(tl) *m2(tl+ict)-m3(t l)*m3(tl+ict) 


Figure A4. Continued. 
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+ -m4 (t 1 ) *m4 ( t 1+ict ) -m5 (t 1 ) *m5 (t 1+ict) 

ict=ict+l 
51 continue 

c *** Finish the (k+5)th multiplier. 
a(ist6)=l . 0/sqrt (a(ist6) ) 

1 1=6 

cdir\$ ivdep 

do 60 i=ist6+l,ist6+collth(k+5)-l 
a(i)=a(i) *a(ist6) 
m6(tl)=a(i) 
tl=tl+l 
60 continue 

idone=k+5 

c *** Count all floating point vector operations for finishing 
c *** this group of 6 multiplier columns. 
nopsf=nopsf+collth(k+5)*36 + 55 
End barrier 

c *** Everyone else copy columns k thru k+5 into local memory, 
if (idone . It . k+5) then 
il=colptr (k)+5 
i2=colptr (k+l)+4 
i3=colptr (k+2) +3 
i4=colptr (k+3) +-2 
i5=colptr (k+4) +1 
i6=colptr (k+5) 
do 6111 ict=l ,collth(k+5)-l 
ml (5+ict)=a(i 1+ict) 
m2(5+ict)=a(i2+ict) 
m3(5+ict)=a(i3+ict) 
m4(5+ict)=a(i4+ict) 
m5(5+ict)=a(i5+ict) 
m6 (5+ict)=a(i6+ict) 

6111 continue 
idone=k+5 
end if 

c *** End jki portion: 6 columns of L completed. 

c *** Begin kji portion: update columns k+6 thru lastr 
c *** of matrix using the 6 completed multiplier columns. 

c *** Variable lastr corresponds to the last row in column k 
c *** that is stored in the variable-band format. 
lastr=k+collth(k) -1 


Figure A4. Continued. 
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c *** 


189: 
190: 
191: 
192: 
193: 
194: 
195: 
196: 
197: 
198: 
199: 
200 : 
201: 
202: 
203: 
204: 
205: 
206: 
207: 
208: 
209: 
210: 
211: 
212: 
213 1- 
214: 
215: 
216: 
217: 
218: 
219: 
220 : 
221: 
222: 
223: 
224: 
225: 
226: 
227: 
228: 
229: 
230: 
231: 
232: 
233: 
234: 


Presched do 61 j=k+6,lastr 
tl=j-k 
itst=tl 

c *** Begin full zero checking 
if (ml(itst) .eq.0.0) then 
if (m2(itst) . eq.0.0) then 
if (m3(itst) .eq.0.0) then 
if (m4(itst) .eq.0.0) then 
if (m5(itst) .eq.0.0) then 
if (m6 (itst) . eq. 0 . 0) then 
goto 611 
endif 
endif 
endif 
endif 
endif 
endif 

c *** End zero checking. 

lth=lastr-j 
cdir\$ ivdep 

do 610 i=colptr(j) ,colptr (j)+lth 

a(i)=a(i) -ml (itst)*ml (tl)-m2(itst)*m2(tl) 

+ -m3 (itst) *m3(tl)-m4(itst) *m4(tl) 

+ -m5(itst)*m5(tl)~m6(itst) *m6(tl) 

tl=tl+l 
610 continue 

c *** Count floating point vector operations in kji update loop 
nopsf =nopsf +12* (lth+1) . 


611 continue 
61 End presched do 

c *** End kji portion: increment k and begin next jki portion 
c *** if k is less than n-6. 

k=k+6 

if (k. It. n-6) goto 999 

c *** Finish any remaining columns of L. There axe up to 7 
c *** columns remaining. The last 7 columns are explicitly 
c *** computed without loop overhead, etc. 

Barrier 

goto (100,110,120, 130 ,140,150, 160) , (n-k+ 1 ) 


Figure A4. Continued. 
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160 continue 

c *** 6 columns left to update 

isti=colptr (k) 
a(istl)=l . 0/sqrt (a(istl)) 
a(istl+l)=a(istl+l)*a(istl) 
a(istl+2)=a(istl+2)*a(istl) 
a(isti+3)=a(istl+3) *a(istl) 
a(istl+4)=a(istl+4) *a(istl) 
a(istl+5)=a(istl+5) *a(ist 1) 
a(istl+6)=a(istl+6) *a(istl) 
ist2=colptr (k+1) 

a(ist2)=a(ist2)-a(istl+l)*a(istl+l) 
a(ist2+l)=a(ist2+l)-a(istl+l) *a(istl+2) 
a(ist2+2)=a(ist2+2)-a(istl+l)*a(istl+3) 
a(ist2+3)=a(ist2+3)-a(istl+l) *a(istl+4) 
a(ist2+4)=a(ist2+4) -a(istl+l) *a(ist 1+5) 
a(ist2+5)=a(ist2+5)-a(istl+l) *a(ist 1+6) 
ist3=colptr (k+2) 

a(ist3)=a(ist3) -a(istl+2) *a Cist 1+2) 
a(ist3+l)=a(ist3+l)-a(istl+2) *a(istl+3) 
a(ist3+2)=a(ist3+2) -a(istl+2) *a(istl+4) 
a(ist3+3)=a(ist3+3) -a(istl+2) *a(istl+5) 
a(ist3+4)=a(ist3+4) -a(istl+2)*a(istl+6) 
ist4=colptr (k+3) 

a(ist4)=a(ist4) -a(istl+3) *a(ist 1+3) 
a(ist4+l)=a(ist4+l)-a(istl+3) *a(istl+4) 
a(ist4+2)=a(ist4+2)-a(istl+3) *a(ist 1+5) 
a(ist4+3)=a(ist4+3) -a(istl+3) *a(ist 1+6) 
ist5=colptr (k+4) 

a(ist5)=a(ist5) -a(istl+4) *a(istl+4) 
a(ist5+l)=a(ist5+l) -a(ist 1+4) *a(istl+5) 
a(ist5+2)=a(ist5+2) -a(istl+4) *a(istl+6) 
ist6=colptr Ck+5) 

a(ist6)=a(ist6)-a(istl+5) *a(istl+5) 
a(ist6+l)=a(ist6+l)-a(istl+5) *a(istl+6) 
aCcolptr (k+6) )=a(colptr (k+6) ) -a(istl+6) *a(ist 1+6) 
k=k+l 

nopsf =nopsf +49 
150 continue 

c *** 5 columns left to update 

istl=colptr (k) 
a(istl)=l .0/sqrt (a (istl) ) 
a(istl+l)=a(istl+l) *a(istl) 
a(istl+2)=a(istl+2) *a(istl) 

Figure A4. Continued. 
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a(istl+3)=a(istl+3)*a(istl) 
a(istl+4)=a(istl+4)*a(istl) 
a(istl+5)=a(istl+5)*a(istl) 
ist2=colptr (k+1) 

a(ist2)=a(ist2)-a(istl+l) *a(istl+l) 
a(ist2+l)=a(ist2+l)-a(istl+l)*a(istl+2) 
a(ist2+2)=a(ist2+2) -a(istl+l)*a(istl+3) 
a(ist2+3)=a(ist2+3) -a(istl+l)*a(istl+4) 
a(ist2+4)=a(ist2+4)-a(istl+l)*a(istl+5) 
ist3=colptr (k+2) 

a(ist3)=a(ist3)-a(istl+2) *a(istl+2) 
a(ist3+l)=a(ist3+l)-a(istl+2) *a (1st 1+3) 
a(ist3+2)=a(ist3+2) -a(istl+2) *a(istl+4) 
a(ist3+3)=a(ist3+3) -a(istl+2) *a(istl+5) 
ist4=colptr (k+3) 

a(ist4)=a(ist4)-a(istl+3)*a(istl+3) 
a(ist4+l) =a(ist4+l) -a(ist 1+3) *a(istl+4) 
a(ist4+2)=a(ist4+2) -a(ist 1+3) *a(istl+5) 
ist5=colptr (k+4) 

a(ist5)=a(ist5)-a(istl+4) *a(istl+4) 
a(ist5+l)=a(ist5+l) -a(istl+4) *a(istl+5) 
a(colptr (k+5))=a(colptr (k+5) ) -a(istl+5) *a(isti+5) 
k=k+l 

nopsf =nopsf +36 
140 continue 

c *** 4 columns left to update 

istl=colptr (k) 
a(istl)=l. 0/sqrt (a(istl)) 
a(istl+l)=a(istl+l)*a(istl) 
a(istl+2)=a(istl+2) *a(istl) 
a(istl+3)=a(istl+3)*a(istl) 
a(istl+4)=a(istl+4)*a(istl) 
ist2=colptr (k+1) 

a(ist2)=a(ist2) -a(istl+l) *a(ist 1+1) 
a(ist2+l)=a(ist2+l)-a(istl+l) *a(istl+2) 
a(ist2+2)=a(ist2+2)-a(istl+l) *a(istl+3) 
a(ist2+3)=a(ist2+3)-a(istl+l) *a(istl+4) 
ist3=colptr (k+2) 

a(ist3)=a(ist3) -a(istl+2) *a(ist 1+2) 
a(ist3+l)=a(ist3+l)-a(istl+2) *a(istl+3) 
a(ist3+2)=a(ist3+2)-a(istl+2)*a(istl+4) 
ist4=colptr (k+3) 

a(ist4)=a(ist4) -a(istl+3) *a(istl+3) 
a(ist4+l)=a(ist4+l)-a(istl+3)*a(istl+4) 
a(colptr (k+4) )=a(colptr (k+4) )-a(istl+4)*a(istl+4) 
k=k+l 

nopsf =nopsf +25 


Figure A4. Continued. 
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130 continue 

c *** 3 columns left to update 

istl=colptr (k) 
a(istl)=l . 0/sqrt (a(istl) ) 
a(istl+l)=a(istl+l)*a(istl) 
a(istl+2)=a(istl+2)*a(istl) 
a(istl+3)=a(istl+3)*a(istl) 
ist2=colptr (k+1) 

a(ist2)=a(ist2) -a(istl+l) *a(istl+l) 
a(ist2+l)=a(ist2+l) -a(istl+l) *a(istl+2) 
a(ist2+2)=a(ist2+2)-a(istl+l)*a Cist 1+3) 
istS^colptr (k+2) 

a(ist3)=a(ist3)-a(istl+2) *a(istl+2) 
a(ist3+l)=a(ist3+l)-a(istl+2) *a(ist 1+3) 
aCcolptr (k+3) )=a(colptr (k+3) ) -a(istl+3) *a(istl+3) 
k=k+l 

nopsf=nopsf +16 
120 continue 

c *** 2 columns left to update 

istl=colptr (k) 
a(istl)=l . 0/sqrt (a(istl)) 
a(isti+l)=a(istl+l)*a(istl) 
a(istl+2)=a(istl+2) *a(ist 1) 
ist2=colptr (k+1) 

a(ist2)=a(ist2)“a(istl+l)*a(istl+l) 
a(ist2+l)-a(ist2+l) -a(istl+l)*a(istl+2) 
a(colptr(k+2))=a(colptr (k+2) )“a(istl+2)*a(istl+2) 
k=k+l 

nopsf =nopsf+9 
110 continue 

c i column left to update 

istl=colptr (k) 
a(istl)=l. 0/sqrt (a(istl) ) 
a(istl+l)=a(istl+l)*a(istl) 
ist2=colptr (k+1) 

a(ist2)=a(ist2)-a(istl+l) *a(istl+l) 
nopsf=nopsf+4 

100 continue 

c *** o columns left to update 

a(colptr (n) )=1 . 0/sqrt (a(colptr (n) ) ) 

nopsf=nopsf+l 

End barrier 

return 

end 


Figure A4. Concluded. 
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subroutine vbsolveCa, alth, colptr , eolith, n,b,nopss) 

integer alth,n,nopss 

integer i , j , ict ,lastr ,left 

integer tl , t2 ,t3, t4 ,t5 ,t6 ,t7 

integer istl , ist2 , ist3, ist4 , ist5 , ist6 

integer colptr (n) , collth(n) 

real a(alth) ,b(n) , t 

c *** This routine performs f orward-and-backward substitution 
c *** 0 f triangular factored matrices. It assumes a variable-band 
c *** data structure with adjustments for loop unrolling to level 6. 
c *** NOTE: This solver assumes that the diagonal elements of 
c *** the triangular systems are the reciprocals of the actual 
c *** diagonal elements 
nopss=0 

c *** Solve Ly=b (column sweep) 

left=mod(n,6) 
last=n-left 
do 1 i=l,last,6 

istl=colptr (i) 
ist2=colptr(i+l) 
ist3=colptr (i+2) 
ist4=colptr (i+3) 
ist5=colptr (i+4) 
ist6=colptr (i+5) 

c *** Begin zero-checking option. 

if (b(i) .eq.0.0) then 
if (b(i+l) . eq. 0 . 0) then 
if (b(i+2) .eq.0.0) then 
if (b(i+3) . eq. 0 .0) then 
if (b(i+4) .eq.0.0) then 
if (b(i+5) . eq.0.0) then 
goto 11 
endif 
endif 
endif 
endif 
endif 
endif 

c *** End zero-checking option. 
b(i)=b(i) *a(istl) 
b(i+l)=(b(i+l)-a(ist 1+1) *b(i) 

+ )*a(ist2) 

b(i+2)=(b(i+2)-a(istl+2) *b(i) 


Figure A5. FORTRAN listing for subroutine vbsolve . 
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+ -a(ist2+l) *b(i+l) 

+ )*a(ist3) 

b(i+3)=(b(i+3)-a(istl+3) *b(i) 

+ -a(ist2+2)*b(i+l) 

+ -a(ist3+l)*b(i+2) 

+ )*a(ist4) 

b(i+4)=(b(i+4)-a(istl+4) *b(i) 

+ -a(ist2+3) *b(i+l) 

+ -a (1st 3+2) *b(i+2) 

+ -a(ist4+l) *b(i+3) 

+ )*a(ist5) 

b(i+5)=(b(i+5)-a(istl+5) *b(i) 

+ -a(ist2+4) *b(i+l) 

+ -a(ist3+3)*b(i+2) 

+ -a(ist4+2)*b(i+3) 

+ -a(ist5+l) *b(i+4) 

+ )*a(ist6) 


nopss=nopss+36 

istl=istl+6 

ist2=ist2+5 


ist3=iSt3+4 

ist4-ist4+3 

ist5=ist5+2 


ist6=ist6+l 


lth=collth(i)-6 


jrow=i+6 
cdir\$ ivdep 

do 10 ict=0, ltb-1 

b(jrow)=b(jrow) -a(ist 1+ict) *b(i) 

+ -a(ist2+ict) *b(i+l) 

+ -a(ist3+ict) *b(i+2) 

+ -a(ist4+ict) *b(i+3) 

+ -a(ist5+ict)*b(i+4) 

+ -a(ist6+ict) *b(i+5) 



jrow=jrow+l 

10 

continue 


nopss=nopss+12*lth 

11 

continue 


1 continue 


goto (100,101,102,103,104,105) , (left+1) 

105 continue 
i=n-4 

istl=colptr (i) 


Figure A5. Continued. 
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94 


b(i)=b(i)*a(istl) 

95 


b(i+l)=b(i+l)-a(istl+l)*b(i) 

96 


b(i+2)=b(i+2)-a(isti+2)*b(i) 

97 


b(i+3)=b(i+3)-a(istl+3)*b(i) 

98 


b(i+4)=b(i+4)-a(istl+4)*b(i) 

99 


nopss=nopss+9 

100 



101 

104 

continue 

102 


i=n-3 

103 


ist l=colptr (i) 

104 


b(i)=b(i) *a(istl) 

105 


b(i+l)=b(i+l)-a(istl+l) *b(i) 

106 


b(i+2)=b(i+2)-a(istl+2) *b(i) 

107 


b(i+3) =b(i+3) -a (ist 1+3) *b(i) 

108 


nopss=nopss+7 

109 



110 

103 

continue 

111 


i=n-2 

112 


istl=colptr(i) 

113 


b(i)=b(i)*a(istl) 

114 


b(i+l)=b(i+l)-a(istl+l)*b(i) 

115 


b(i+2)=b(i+2) -a(istl+2)*b(i) 

116 


nopss=nopss+5 

117 



118 

102 

continue 

119 


i=n-l 

120 


istl=colptr (i) 

121 


b(i)=b(i)*a(istl) 

122 


b(i+l)=b(i+l)-a(istl+l)*b(i) 

123 


nopss=nopss+3 

124 



125 

101 

continue 

126 


istl=colptr (n) 

127 


b(n)=b(n) *a(ist 1) 

128 


nopss=nopss+l 

129 



130 

100 

continue 

131 



132 

c *** 

Solve L(t)x=y 

133 



134 


goto (200,201,202,203,204,205) left+1 

135 



136 

205 

continue 

137 


istl=colptr (n) 

138 


ist2=colptr (n-1) 

139 


ist3=colptr (n-2) 


Figure A5. Continued. 
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ist4=colptr (n-3) 
ist5=colptr (n-4) 
b(n) =b(n) *a(istl) 
b(n-l)=(b(n-l)-a(ist2+l) *b(n) 

) *a(ist2) 

b (n-2) = (b (n-2) -a(ist3+2) *b (n) 
-a(ist3+l) *b(n-l) 
) *a(ist3) 

b (n-3) = (b (n-3) -a (ist4+3) *b (n) 
-a(ist4+2) *b(n-l) 
-a(ist4+l) *b(n-2) 
)*a(ist4) 

b(n-4)=(b(n-4) -a(ist5+4) *b(n) 
-a(ist5+3) *b(n-l) 
-a (i st 5+2) *b(n-2) 
-a(ist5+l) *b(n-3) 
)*a(ist5) 
nopss=nopss+25 
goto 200 


204 continue 

istl=colptr (n) 
ist2=colptr (n-1) 
ist3=colptr (n-2) 
ist4=colptr (n-3) 
b(n)=b(n) *a(istl) 
b(n-l)=(b(n-l)-a(ist2+l) *b(n) 

+ )*a(ist2) 

b(n-2)=(b(n-2) -a(ist3+2) *b(n) 

+ -a(ist3+l) *b(n-l) 

+ )*a(ist3) 

b(n-3)=(b(n-3) -a(ist4+3) *b(n) 

+ -a(ist4+2) *b(n-l) 

+ -a(ist4+l) *b(n-2) 

+ ) *a(ist4) 

nopss=nopss+16 
goto 200 


203 continue 

istl=colptr (n) 
ist2=colptr (n-1) 
ist3=colptr (n-2) 
b(n)=b(n) *a(istl) 
b(n-l)=(b(n-l) -a(ist2+l) *b(n) 

+ )*a(ist2) 

b (n-2) = (b (n-2) -a(ist3+2) *b (n) 

+ -a(ist3+l) *b(n-l) 

+ )*a(ist3) 


Figure A5. Continued. 
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nopss=nopss+9 
goto 200 

202 continue 

istl=colptr(n) 
ist2=colptr (n-1) 
b(n)=b(n) *a(istl) 
b(n-l)=(b(n-l)-a(ist2+l)*b(n) 

+ )*a(ist2) 

nopss=nopss+4 
goto 200 

201 continue 

istl=colptr (n) 
b(n)=b(n) *a(istl) 
nopss=nopss+l 

200 continue 

do 2 i=last , 1,-6 
lth=collth(i)-l 
istl=colptr (i-5) +5 
ist2=colptr (i-4)+4 
ist3=colptr (i-3)+3 
ist4=colptr (i-2)+2 
ist5=colptr(i-l)+l 
ist6=colptr (i) 
if (lth.eq.0) goto 21 
cdir\$ ivdep 

do 20 ict=l,lth 

b(i)=b(i)-a(ist6+ict)*b(i+ict) 
b(i-l)=b(i-l)-a(ist5+ict) *b(i+ict) 
b(i-2)=b(i-2)-a(ist4+ict) *b(i+ict) 
b(i-3)=b(i-3)-a(ist3+ict) *b(i+ict) 
b(i-4)=b(i-4)-a(ist2+ict) *b(i+ict) 
b(i-5)=b(i-5)-a(istl+ict) *b(i+ict) 

20 continue 
nopss=nopss+12*lth 

21 continue 
b(i)=b(i)*a(ist6) 
b(i-l)=(b(i-l) -a(ist5) *b(i) 

+ )*a(ist5-l) 

b(i-2)=(b(i-2)-a(ist4)*b(i) 

+ -a(ist4-l) *b(i-i) 

+ )*a(ist4-2) 

b(i-3)=(b(i-3) -a(ist3) *b(i) 


Figure A5. Continued. 



234 

+ 

-a(ist3-l) *b(i-l) 

235 

+ 

-a(ist3-2) *b(i-2) 

236 

+ 

)*a(ist3-3) 

237 


b(i-4) = (b (i-4)-a(ist2) *b(i) 

238 

+ 

-a(ist2-l) *b(i-l) 

239 

+ 

-a (1st 2-2) *b(i-2) 

240 

+ 

-a(ist2-3)*b(i-3) 

241 

+ 

) *a(ist2-4) 

242 


b(i-5)=(b(i-5)-a(istl)*b(i) 

243 

+ 

-a(isti-l)*b(i-l) 

244 

+ 

-a(istl-2)*b(i-2) 

245 

+ 

-a(istl-3) *b(i-3) 

246 

+ 

-a(istl-4) *b(i-4) 

247 

+ 

)*a(istl-5) 

248 


nopss=nopss+36 

249 

2 

continue 

250 



251 


return 

252 


end 


Figure A5. Concluded. 
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